Metadata-Version: 2.1
Name: pythresh
Version: 0.3.0
Summary: A Python Toolbox for Outlier Detection Thresholding
Home-page: https://github.com/KulikDM/pythresh
Author: D Kulik
License: UNKNOWN
Download-URL: https://github.com/KulikDM/pythresh/archive/master.zip
Project-URL: Documentation, https://pythresh.readthedocs.io/en/latest/
Keywords: outlier detection,anomaly detection,thresholding,cutoff,contamintion level,data science,machine learning
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Description-Content-Type: text/x-rst
License-File: LICENSE

##################################################
 Python Outlier Detection Thresholding (PyThresh)
##################################################

**Deployment, Stats, & License**

.. image:: https://img.shields.io/pypi/v/pythresh.svg?color=brightgreen&logo=pypi&logoColor=white
   :target: https://pypi.org/project/pythresh/
   :alt: PyPI version

.. image:: https://img.shields.io/conda/vn/conda-forge/pythresh?color=brightgreen&logo=conda-forge&logoColor=white
   :target: https://anaconda.org/conda-forge/pythresh
   :alt: Anaconda version

.. image:: https://readthedocs.org/projects/pythresh/badge/?version=latest
   :target: http://pythresh.readthedocs.io/?badge=latest
   :alt: Documentation status

.. image:: https://github.com/KulikDM/pythresh/actions/workflows/python-package.yml/badge.svg
   :target: https://github.com/KulikDM/pythresh/actions/workflows/python-package.yml
   :alt: testing

.. image:: https://codecov.io/gh/KulikDM/pythresh/branch/main/graph/badge.svg?token=8ZAPXTLW9Y
   :target: https://codecov.io/gh/KulikDM/pythresh
   :alt: Codecov

.. image:: https://img.shields.io/github/stars/KulikDM/pythresh.svg?logo=github&logoColor=white
   :target: https://github.com/KulikDM/pythresh/stargazers
   :alt: GitHub stars

.. image:: https://pepy.tech/badge/pythresh?
   :target: https://pepy.tech/project/pythresh
   :alt: Downloads

.. image:: https://img.shields.io/pypi/pyversions/pythresh.svg?logo=python&logoColor=white
   :target: https://pypi.org/project/pythresh/
   :alt: Python versions

.. image:: https://img.shields.io/github/license/KulikDM/pythresh.svg
   :target: https://github.com/KulikDM/pythresh/blob/master/LICENSE
   :alt: License

----

PyThresh is a comprehensive and scalable **Python toolkit** for
**thresholding outlier detection scores** in univariate/multivariate
data. It has been writen to work in tandem with PyOD and has similar
syntax and data structures. However, it is not limited to this single
library. PyThresh is meant to threshold scores generated by an outlier
detection. It thresholds scores without the need to set a contamination
level or have the user guess the amount of outliers that may exist in
the dataset beforehand. These non-parametric methods were written to
reduce the user's input/guess work and rather rely on statistics instead
to threshold outlier scores. For thresholding to be applied correctly,
the outlier detection scores must follow this rule: the higher the
score, the higher the probability that it is an outlier in the dataset.
All threshold functions return a binary array where inliers and outliers
are represented by a 0 and 1 respectively.

PyThresh includes more than 30 thresholding algorithms. These algorithms
range from using simple statistical analysis like the Z-score to more
complex mathematical methods that involve graph theory and topology.

***************
 Documentation
***************

Visit `PyThresh Docs <https://pythresh.readthedocs.io/en/latest/?badge=latest>`_
for full documentation or see below for a quickstart installation and usage example

----

**Outlier Detection Thresholding with 7 Lines of Code**:

.. code:: python

   # train the KNN detector
   from pyod.models.knn import KNN
   from pythresh.thresholds.clust import CLUST

   clf = KNN()
   clf.fit(X_train)

   # get outlier scores
   decision_scores = clf.decision_scores_  # raw outlier scores on the train data

   # get outlier labels
   thres = CLUST()
   labels = thres.eval(decision_scores)

**************
 Installation
**************

It is recommended to use **pip** or **conda** for installation:

.. code:: bash

   pip install pythresh            # normal install
   pip install --upgrade pythresh  # or update if needed

.. code:: bash

   conda install -c conda-forge pythresh

Alternatively, you can get the version with the latest updates by
cloning the repo and run setup.py file:

.. code:: bash

   git clone https://github.com/KulikDM/pythresh.git
   cd pythresh
   pip install .

Or with **pip**:

.. code:: bash

   pip install https://github.com/KulikDM/pythresh/archive/main.zip

**Required Dependencies**:

-  matplotlib
-  numpy>=1.13
-  pyclustering
-  pyod
-  scipy>=1.3.1
-  scikit_learn>=0.20.0
-  six

**Optional Dependencies**:

-  ruptures (used in the CPD thresholder)
-  geomstats (used in the KARCH thresholder)
-  scikit-lego (used in the META thresholder)
-  joblib>=0.14.1 (used in the META thresholder)
-  pandas (used in the META thresholder)
-  torch (used in the VAE thresholder)
-  tqdm (used in the VAE thresholder)

****************
 API Cheatsheet
****************

-  **eval(score)**: evaluate outlier score.

Key Attributes of threshold:

-  **thresh_**: Return the threshold value that separates inliers from
   outliers. Outliers are considered all values above this threshold
   value. Note the threshold value has been derived from normalized
   scores.

-  **confidence_interval_**: Return the lower and upper confidence
   interval of the contamination level. Only applies to the COMB
   thresholder

************************
 External Feature Cases
************************

**Towards Data Science**: `Thresholding Outlier Detection Scores with
PyThresh
<https://towardsdatascience.com/thresholding-outlier-detection-scores-with-pythresh-f26299d14fa>`_

**Towards Data Science**: `When Outliers are Significant: Weighted
Linear Regression
<https://towardsdatascience.com/when-outliers-are-significant-weighted-linear-regression-bcdc8389ab10>`_

**ArXiv**: `Estimating the Contamination Factor's Distribution in
Unsupervised Anomaly Detection <https://arxiv.org/abs/2210.10487>`_

***********************************
 Available Thresholding Algorithms
***********************************

+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| Abbr      | Description                               | References         | Documentation                                                                                                                                          |
+===========+===========================================+====================+========================================================================================================================================================+
| AUCP      | Area Under Curve Percentage               | [#aucp1]_          | `pythresh.thresholds.aucp module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.aucp>`_                |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| BOOT      | Bootstrapping                             | [#boot1]_          | `pythresh.thresholds.boot module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.boot>`_                |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| CHAU      | Chauvenet's Criterion                     | [#chau1]_          | `pythresh.thresholds.chau module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.chau>`_                |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| CLF       | Trained Linear Classifier                 | [#clf1]_           | `pythresh.thresholds.clf module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.clf>`_                  |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| CLUST     | Clustering Based                          | [#clust1]_         | `pythresh.thresholds.clust module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.clust>`_              |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| CPD       | Change Point Detection                    | [#cpd1]_           | `pythresh.thresholds.cpd module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.cpd>`_                  |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| DECOMP    | Decomposition                             | [#decomp1]_        | `pythresh.thresholds.decomp module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.decomp>`_            |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| DSN       | Distance Shift from Normal                | [#dsn1]_           | `pythresh.thresholds.dsn module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.dsn>`_                  |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| EB        | Elliptical Boundary                       | [#eb1]_            | `pythresh.thresholds.eb module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.eb>`_                    |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| FGD       | Fixed Gradient Descent                    | [#fgd1]_           | `pythresh.thresholds.fgd module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.fgd>`_                  |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| FILTER    | Filtering Based                           | [#filter1]_        | `pythresh.thresholds.filter module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.filter>`_            |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| FWFM      | Full Width at Full Minimum                | [#fwfm1]_          | `pythresh.thresholds.fwfm module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.fwfm>`_                |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| GESD      | Generalized Extreme Studentized Deviate   | [#gesd1]_          | `pythresh.thresholds.gesd module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.gesd>`_                |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| HIST      | Histogram Based                           | [#hist1]_          | `pythresh.thresholds.hist module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.hist>`_                |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| IQR       | Inter-Quartile Region                     | [#iqr1]_           | `pythresh.thresholds.iqr module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.iqr>`_                  |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| KARCH     | Karcher mean (Riemannian Center of Mass)  | [#karch1]_         | `pythresh.thresholds.karch module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.karch>`_              |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| MAD       | Median Absolute Deviation                 | [#mad1]_           | `pythresh.thresholds.mad module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.mad>`_                  |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| MCST      | Monte Carlo Shapiro Tests                 | [#mcst1]_          | `pythresh.thresholds.mcst module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.mcst>`_                |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| META      | Meta-model Trained Classifier             | [#meta1]_          | `pythresh.thresholds.meta module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.meta>`_                |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| MOLL      | Friedrichs' Mollifier                     | [#moll1]_          | `pythresh.thresholds.moll module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.moll>`_                |
|           |                                           | [#moll2]_          |                                                                                                                                                        |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| MTT       | Modified Thompson Tau Test                | [#mtt1]_           | `pythresh.thresholds.mtt module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.mtt>`_                  |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| OCSVM     | One-Class Support Vector Machine          | [#ocsvm]_          | `pythresh.thresholds.ocsvm module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#pythresh-thresholds-ocsvm-module>`_              |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| QMCD      | Quasi-Monte Carlo Discrepancy             | [#qmcd1]_          | `pythresh.thresholds.qmcd module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.qmcd>`_                |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| REGR      | Regression Based                          | [#regr1]_          | `pythresh.thresholds.regr module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.regr>`_                |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| VAE       | Variational Autoencoder                   | [#vae1]_           | `pythresh.thresholds.vae module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.vae>`_                  |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| WIND      | Topological Winding Number                | [#wind1]_          | `pythresh.thresholds.wind module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.wind>`_                |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| YJ        | Yeo-Johnson Transformation                | [#yj1]_            | `pythresh.thresholds.yj module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.yj>`_                    |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| ZSCORE    | Z-score                                   | [#zscore1]_        | `pythresh.thresholds.zscore module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.zscore>`_            |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| ALL       | All Thresholders Combined                 | None               | `pythresh.thresholds.all module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.all>`_                  |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| COMB      | Thresholder Combination                   | None               | `pythresh.thresholds.comb module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.comb>`_                |
+-----------+-------------------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+

******************************
 Implementations & Benchmarks
******************************

**The comparison among implemented models and general implementation**
is made available below

Additional `benchmarking <https://pythresh.readthedocs.io/en/latest/benchmark.html>`_ 
has been done on all the thresholders and it was
found that the ``META`` thresholder performed best while the ``CLF`` 
thresholder provided the smallest uncertainty about its mean and is the 
most robust (best least accurate prediction). However, for interpretability 
and general performance the ``FILTER`` thresholder is a good fit.

----

For Jupyter Notebooks, please navigate to `notebooks
<https://github.com/KulikDM/pythresh/tree/main/notebooks>`_.

A quick look at all the thresholders performance can be found at
**"/notebooks/Compare All Models.ipynb"**

.. image:: https://raw.githubusercontent.com/KulikDM/pythresh/main/imgs/All.png
   :target: https://raw.githubusercontent.com/KulikDM/pythresh/main/imgs/All.png
   :alt: Comparision_of_All

----


**************
 Contributing
**************

Anyone is welcome to contribute to PyThresh:

* Please share your ideas and ask questions by opening an issue.

* To contribute, first check the Issue list for the "help wanted" tag and comment 
  on the one that you are interested in. The issue will then be assigned to you.

* If the bug, feature, or documentation change is novel (not in the Issue list),
  you can either log a new issue or create a pull request for the new changes.

* To start, fork the main branch and add your improvement/modification/fix.

* To make sure the code has the same style and standard, please refer to qmcd.py for 
  example.

* Create a pull request to the **main branch** and follow the pull request template 
  `PR template <https://github.com/KulikDM/pythresh/blob/main/.github/PULL_REQUEST_TEMPLATE.md>`_

* Please make sure that all code changes are accompanied with proper new/updated test 
  functions. Automatic tests will be triggered. Before the pull request can be merged, 
  make sure that all the tests pass.

----

************
 References
************

**Please Note** not all references' exact methods have been employed in
PyThresh. Rather, the references serve to demonstrate the validity of
the threshold types available in PyThresh.

.. [#aucp1]

   `A Robust AUC Maximization Framework With Simultaneous Outlier Detection
   and Feature Selection for Positive-Unlabeled Classification
   <https://arxiv.org/abs/1803.06604>`_

.. [#boot1]

   `An evaluation of bootstrap methods for outlier detection in least
   squares regression
   <https://www.researchgate.net/publication/24083638_An_evaluation_of_bootstrap_methods_for_outlier_detection_in_least_squares_regression>`_

.. [#chau1]

   `Chauvenet’s Test in the Classical Theory of Errors
   <https://epubs.siam.org/doi/10.1137/1119078>`_

.. [#clf1]

   `Linear Models for Outlier Detection
   <https://link.springer.com/chapter/10.1007/978-3-319-47578-3_3>`_

.. [#clust1]

   `Cluster Analysis for Outlier Detection
   <https://www.researchgate.net/publication/224990195_Cluster_Analysis_for_Outlier_Detection>`_

.. [#cpd1]

   `Changepoint Detection in the Presence of Outliers
   <https://arxiv.org/abs/1609.07363>`_

.. [#decomp1]

   `Influence functions and outlier detection under the common principal
   components model: A robust approach
   <https://www.researchgate.net/publication/5207186_Influence_functions_and_outlier_detection_under_the_common_principal_components_model_A_robust_approach>`_

.. [#dsn1]

   `Fast and Exact Outlier Detection in Metric Spaces: A Proximity
   Graph-based Approach <https://arxiv.org/abs/2110.08959>`_

.. [#eb1]

   `Elliptical Insights: Understanding Statistical Methods through
   Elliptical Geometry <https://arxiv.org/abs/1302.4881>`_

.. [#fgd1]

   `Iterative gradient descent for outlier detection
   <https://www.worldscientific.com/doi/10.1142/S0219691321500041>`_

.. [#filter1]

   `Filtering Approaches for Dealing with Noise in Anomaly Detection
   <https://ieeexplore.ieee.org/document/9029258/>`_

.. [#fwfm1]

   `Sparse Auto-Regressive: Robust Estimation of AR Parameters
   <https://arxiv.org/abs/1306.3317>`_

.. [#gesd1]

   `An adjusted Grubbs' and generalized extreme studentized deviation
   <https://www.degruyter.com/document/doi/10.1515/dema-2021-0041/html?lang=en>`_

.. [#hist1]

   `Effective Histogram Thresholding Techniques for Natural Images Using
   Segmentation
   <http://www.joig.net/uploadfile/2015/0116/20150116042320548.pdf>`_

.. [#iqr1]

   `A new non-parametric detector of univariate outliers for distributions
   with unbounded support <https://arxiv.org/abs/1509.02473>`_

.. [#karch1]

   `Riemannian center of mass and mollifier smoothing
   <https://www.jstor.org/stable/41059320>`_

.. [#mad1]

   `Periodicity Detection of Outlier Sequences Using Constraint Based
   Pattern Tree with MAD <https://arxiv.org/abs/1507.01685>`_

.. [#mcst1]

   `Testing normality in the presence of outliers
   <https://www.researchgate.net/publication/24065017_Testing_normality_in_the_presence_of_outliers>`_

.. [#meta1]

   `Automating Outlier Detection via Meta-Learning
   <https://arxiv.org/abs/2009.10606>`_

.. [#moll1]

   `Riemannian center of mass and mollifier smoothing
   <https://www.jstor.org/stable/41059320>`_

.. [#moll2]

   `Using the mollifier method to characterize datasets and models: The
   case of the Universal Soil Loss Equation
   <https://www.researchgate.net/publication/286670128_Using_the_mollifier_method_to_characterize_datasets_and_models_The_case_of_the_Universal_Soil_Loss_Equation>`_

.. [#mtt1]

   `Towards a More Reliable Interpretation of Machine Learning Outputs for
   Safety-Critical Systems using Feature Importance Fusion
   <https://arxiv.org/abs/2009.05501>`_

.. [#ocsvm]

   `Rule extraction in unsupervised anomaly detection for model
   explainability: Application to OneClass SVM
   <https://arxiv.org/abs/1911.09315>`_

.. [#qmcd1]

   `Deterministic and quasi-random sampling of optimized Gaussian mixture
   distributions for vibronic Monte Carlo
   <https://arxiv.org/abs/1912.11594>`_

.. [#regr1]

   `Linear Models for Outlier Detection
   <https://link.springer.com/chapter/10.1007/978-3-319-47578-3_3>`_

.. [#vae1]

   `Likelihood Regret: An Out-of-Distribution Detection Score For
   Variational Auto-encoder <https://arxiv.org/abs/2003.02977>`_

.. [#wind1]

   `Robust Inside-Outside Segmentation Using Generalized Winding Numbers
   <https://www.researchgate.net/publication/262165781_Robust_Inside-Outside_Segmentation_Using_Generalized_Winding_Numbers>`_

.. [#yj1]

   `Transforming variables to central normality
   <https://arxiv.org/abs/2005.07946>`_

.. [#zscore1]

   `Multiple outlier detection tests for parametric models
   <https://arxiv.org/abs/1910.10426>`_


