Metadata-Version: 2.1
Name: pythresh
Version: 0.2.8
Summary: A Python Toolbox for Outlier Detection Thresholding
Home-page: https://github.com/KulikDM/pythresh
Author: D Kulik
License: UNKNOWN
Download-URL: https://github.com/KulikDM/pythresh/archive/master.zip
Project-URL: Documentation, https://pythresh.readthedocs.io/en/latest/
Keywords: outlier detection,anomaly detection,thresholding,cutoff,contamintion level,data science,machine learning
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Description-Content-Type: text/x-rst
License-File: LICENSE

Python Outlier Detection Thresholding (PyThresh)
================================================

**Deployment, Stats, & License**

.. image:: https://img.shields.io/pypi/v/pythresh.svg?color=brightgreen&logo=pypi&logoColor=white
   :target: https://pypi.org/project/pythresh/
   :alt: PyPI version

.. image:: https://readthedocs.org/projects/pythresh/badge/?version=latest
   :target: http://pythresh.readthedocs.io/?badge=latest
   :alt: Documentation status

.. image:: https://github.com/KulikDM/pythresh/actions/workflows/python-package.yml/badge.svg
   :target: https://github.com/KulikDM/pythresh/actions/workflows/python-package.yml
   :alt: testing

.. image:: https://codecov.io/gh/KulikDM/pythresh/branch/main/graph/badge.svg?token=8ZAPXTLW9Y 
   :target: https://codecov.io/gh/KulikDM/pythresh
   :alt: Codecov

.. image:: https://img.shields.io/github/stars/KulikDM/pythresh.svg?logo=github&logoColor=white
   :target: https://github.com/KulikDM/pythresh/stargazers
   :alt: GitHub stars

.. image:: https://pepy.tech/badge/pythresh?
   :target: https://pepy.tech/project/pythresh
   :alt: Downloads
   
.. image:: https://img.shields.io/pypi/pyversions/pythresh.svg?logo=python&logoColor=white
   :target: https://pypi.org/project/pythresh/
   :alt: Python versions
  
.. image:: https://img.shields.io/github/license/KulikDM/pythresh.svg
   :target: https://github.com/KulikDM/pythresh/blob/master/LICENSE
   :alt: License


-----

PyThresh is a comprehensive and scalable **Python toolkit** for **thresholding outlier detection scores** in univariate/multivariate data. It has been writen to work in tandem with PyOD and has similar syntax and data structures. However, it is not limited to this single library. PyThresh is meant to threshold scores generated by an outlier detection. It thresholds scores without the need to set a contamination level or have the user guess the amount of outliers that may exist in the dataset beforehand. These non-parametric methods were written to reduce the user's input/guess work and rather rely on statistics instead to threshold outlier scores. For thresholding to be applied correctly, the outlier detection scores must follow this rule: the higher the score, the higher the probability that it is an outlier in the dataset. All threshold functions return a binary array where inliers and outliers are represented by a 0 and 1 respectively. 

PyThresh includes more than 30 thresholding algorithms. These algorithms range from using simple statistical analysis like the Z-score to more complex mathematical methods that involve graph theory and topology. 


**Outlier Detection Thresholding with 7 Lines of Code**\ :


.. code-block:: python


    # train the KNN detector
    from pyod.models.knn import KNN
    from pythresh.thresholds.clust import CLUST
    
    clf = KNN()
    clf.fit(X_train)

    # get outlier scores
    decision_scores = clf.decision_scores_  # raw outlier scores on the train data
    
    # get outlier labels 
    thres = CLUST()
    labels = thres.eval(decision_scores)


Installation
^^^^^^^^^^^^

It is recommended to use **pip** for installation:

.. code-block:: bash

   pip install pythresh            # normal install
   pip install --upgrade pythresh  # or update if needed

Alternatively, you can get the version with the latest updates by
cloning the repo and run setup.py file:

.. code-block:: bash

   git clone https://github.com/KulikDM/pythresh.git
   cd pythresh
   pip install .

Or with **pip**:

.. code-block:: bash

   pip install https://github.com/KulikDM/pythresh/archive/main.zip

**Required Dependencies**\ :


* matplotlib
* numpy>=1.13
* pyclustering
* pyod
* scipy>=1.3.1
* scikit_learn>=0.20.0
* six

**Optional Dependencies**\ :

* geomstats (used in the KARCH thresholder)
* scikit-lego (used in the META thresholder)
* joblib>=0.14.1 (used in the META thresholder)
* pandas (used in the META thresholder)
* torch (used in the VAE thresholder)
* tqdm (used in the VAE thresholder)


API Cheatsheet
^^^^^^^^^^^^^^


* **eval(score)**\ : evaluate outlier score.

Key Attributes of threshold:


* **thresh_**\ : Return the threshold value that separates inliers from outliers. Outliers are considered all values above this threshold value. Note the threshold value has been derived from normalized scores.

* **confidence_interval_**\ : Return the lower and upper confidence interval of the contamination level. Only applies to the ALL thresholder

External Feature Cases
^^^^^^^^^^^^^^^^^^^^^^

**Towards Data Science**: `Thresholding Outlier Detection Scores with PyThresh  <https://towardsdatascience.com/thresholding-outlier-detection-scores-with-pythresh-f26299d14fa>`_ 

**Towards Data Science**: `When Outliers are Significant: Weighted Linear Regression <https://towardsdatascience.com/when-outliers-are-significant-weighted-linear-regression-bcdc8389ab10>`_

**ArXiv**: `Estimating the Contamination Factor's Distribution in Unsupervised Anomaly Detection <https://arxiv.org/abs/2210.10487>`_

Available Thresholding Algorithms
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

=========== =========================================== ==================== ==============================================================================
Abbr        Description                                 References           Documentation   
=========== =========================================== ==================== ==============================================================================
AUCP        Area Under Curve Percentage                 [#aucp1]_            `pythresh.thresholds.aucp module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.aucp>`_
BOOT        Bootstrapping                               [#boot1]_            `pythresh.thresholds.boot module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.boot>`_
CHAU        Chauvenet's Criterion                       [#chau1]_            `pythresh.thresholds.chau module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.chau>`_
CLF         Trained Linear Classifier                   [#clf1]_             `pythresh.thresholds.clf module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.clf>`_
CLUST       Clustering Based                            [#clust1]_           `pythresh.thresholds.clust module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.clust>`_
DECOMP      Decomposition                               [#decomp1]_          `pythresh.thresholds.decomp module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.decomp>`_
DSN         Distance Shift from Normal                  [#dsn1]_             `pythresh.thresholds.dsn module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.dsn>`_
EB          Elliptical Boundary                         [#eb1]_              `pythresh.thresholds.eb module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.eb>`_
FGD         Fixed Gradient Descent                      [#fgd1]_             `pythresh.thresholds.fgd module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.fgd>`_
FILTER      Filtering Based                             [#filter1]_          `pythresh.thresholds.filter module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.filter>`_
FWFM        Full Width at Full Minimum                  [#fwfm1]_            `pythresh.thresholds.fwfm module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.fwfm>`_
GESD        Generalized Extreme Studentized Deviate     [#gesd1]_            `pythresh.thresholds.gesd module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.gesd>`_
HIST        Histogram Based                             [#hist1]_            `pythresh.thresholds.hist module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.hist>`_
IQR         Inter-Quartile Region                       [#iqr1]_		        `pythresh.thresholds.iqr module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.iqr>`_
KARCH       Karcher mean (Riemannian Center of Mass)    [#karch1]_           `pythresh.thresholds.karch module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.karch>`_
MAD         Median Absolute Deviation                   [#mad1]_			     `pythresh.thresholds.mad module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.mad>`_
MCST        Monte Carlo Shapiro Tests                   [#mcst1]_            `pythresh.thresholds.mcst module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.mcst>`_
META        Meta-model Trained Classifier               [#meta1]_			     `pythresh.thresholds.meta module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.meta>`_
MOLL        Friedrichs' Mollifier                       [#moll1]_ [#moll2]_  `pythresh.thresholds.moll module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.moll>`_
MTT         Modified Thompson Tau Test                  [#mtt1]_			     `pythresh.thresholds.mtt module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.mtt>`_
OCSVM       One-Class Support Vector Machine            [#ocsvm]_            `pythresh.thresholds.ocsvm module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#pythresh-thresholds-ocsvm-module>`_
QMCD        Quasi-Monte Carlo Discrepancy               [#qmcd1]_		        `pythresh.thresholds.qmcd module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.qmcd>`_
REGR        Regression Based                            [#regr1]_            `pythresh.thresholds.regr module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.regr>`_
VAE         Variational Autoencoder                     [#vae1]_             `pythresh.thresholds.vae module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.vae>`_ 
WIND        Topological Winding Number                  [#wind1]_            `pythresh.thresholds.wind module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.wind>`_
YJ          Yeo-Johnson Transformation                  [#yj1]_			     `pythresh.thresholds.yj module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.yj>`_
ZSCORE      Z-score                                     [#zscore1]_			  `pythresh.thresholds.zscore module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.zscore>`_
ALL         All Thresholders Combined                   None                 `pythresh.thresholds.all module <https://pythresh.readthedocs.io/en/latest/pythresh.thresholds.html#module-pythresh.thresholds.all>`_
=========== =========================================== ==================== ==============================================================================


Implementations & Benchmarks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**The comparison among implemented models and general implementation** is made available below

For Jupyter Notebooks, please navigate to `notebooks <https://github.com/KulikDM/pythresh/tree/main/notebooks>`_.

A quick look at all the thresholders performance can be found at **"/notebooks/Compare All Models.ipynb"**

.. image:: https://raw.githubusercontent.com/KulikDM/pythresh/main/imgs/All.png
   :target: https://raw.githubusercontent.com/KulikDM/pythresh/main/imgs/All.png
   :alt: Comparision_of_All
   
   
References
^^^^^^^^^^

**Please Note** not all references' exact methods have been employed in PyThresh. Rather, the references serve to demonstrate the validity of the threshold types available in PyThresh. 

.. [#aucp1] `A Robust AUC Maximization Framework With Simultaneous Outlier Detection and Feature Selection for Positive-Unlabeled Classification <https://arxiv.org/abs/1803.06604>`_

.. [#boot1] `An evaluation of bootstrap methods for outlier detection in least squares regression <https://www.researchgate.net/publication/24083638_An_evaluation_of_bootstrap_methods_for_outlier_detection_in_least_squares_regression>`_

.. [#chau1] `Chauvenet’s Test in the Classical Theory of Errors <https://epubs.siam.org/doi/10.1137/1119078>`_

.. [#clf1] `Linear Models for Outlier Detection <https://link.springer.com/chapter/10.1007/978-3-319-47578-3_3>`_

.. [#clust1] `Cluster Analysis for Outlier Detection <https://www.researchgate.net/publication/224990195_Cluster_Analysis_for_Outlier_Detection>`_

.. [#decomp1] `Influence functions and outlier detection under the common principal components model: A robust approach <https://www.researchgate.net/publication/5207186_Influence_functions_and_outlier_detection_under_the_common_principal_components_model_A_robust_approach>`_

.. [#dsn1] `Fast and Exact Outlier Detection in Metric Spaces: A Proximity Graph-based Approach <https://arxiv.org/abs/2110.08959>`_

.. [#eb1] `Elliptical Insights: Understanding Statistical Methods through Elliptical Geometry <https://arxiv.org/abs/1302.4881>`_

.. [#fgd1] `Iterative gradient descent for outlier detection <https://www.worldscientific.com/doi/10.1142/S0219691321500041>`_

.. [#filter1] `Filtering Approaches for Dealing with Noise in Anomaly Detection <https://ieeexplore.ieee.org/document/9029258/>`_

.. [#fwfm1] `Sparse Auto-Regressive: Robust Estimation of AR Parameters <https://arxiv.org/abs/1306.3317>`_

.. [#gesd1] `An adjusted Grubbs' and generalized extreme studentized deviation <https://www.degruyter.com/document/doi/10.1515/dema-2021-0041/html?lang=en>`_

.. [#hist1] `Effective Histogram Thresholding Techniques for Natural Images Using Segmentation <http://www.joig.net/uploadfile/2015/0116/20150116042320548.pdf>`_

.. [#iqr1] `A new non-parametric detector of univariate outliers for distributions with unbounded support <https://arxiv.org/abs/1509.02473>`_

.. [#karch1] `Riemannian center of mass and mollifier smoothing <https://www.jstor.org/stable/41059320>`_

.. [#mad1] `Periodicity Detection of Outlier Sequences Using Constraint Based Pattern Tree with MAD <https://arxiv.org/abs/1507.01685>`_

.. [#mcst1] `Testing normality in the presence of outliers <https://www.researchgate.net/publication/24065017_Testing_normality_in_the_presence_of_outliers>`_

.. [#meta1] `Automating Outlier Detection via Meta-Learning <https://arxiv.org/abs/2009.10606>`_

.. [#moll1] `Riemannian center of mass and mollifier smoothing <https://www.jstor.org/stable/41059320>`_

.. [#moll2] `Using the mollifier method to characterize datasets and models: The case of the Universal Soil Loss Equation <https://www.researchgate.net/publication/286670128_Using_the_mollifier_method_to_characterize_datasets_and_models_The_case_of_the_Universal_Soil_Loss_Equation>`_

.. [#mtt1] `Towards a More Reliable Interpretation of Machine Learning Outputs for Safety-Critical Systems using Feature Importance Fusion <https://arxiv.org/abs/2009.05501>`_

.. [#ocsvm] `Rule extraction in unsupervised anomaly detection for model explainability: Application to OneClass SVM <https://arxiv.org/abs/1911.09315>`_

.. [#qmcd1] `Deterministic and quasi-random sampling of optimized Gaussian mixture distributions for vibronic Monte Carlo <https://arxiv.org/abs/1912.11594>`_

.. [#regr1] `Linear Models for Outlier Detection <https://link.springer.com/chapter/10.1007/978-3-319-47578-3_3>`_

.. [#vae1] `Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder <https://arxiv.org/abs/2003.02977>`_

.. [#wind1] `Robust Inside-Outside Segmentation Using Generalized Winding Numbers <https://www.researchgate.net/publication/262165781_Robust_Inside-Outside_Segmentation_Using_Generalized_Winding_Numbers>`_

.. [#yj1] `Transforming variables to central normality <https://arxiv.org/abs/2005.07946>`_

.. [#zscore1] `Multiple outlier detection tests for parametric models <https://arxiv.org/abs/1910.10426>`_


