Metadata-Version: 2.1
Name: DPA
Version: 0.0.3
Summary: The Density Peak Advanced packages.
Home-page: UNKNOWN
Author: Maria d'Errico
License: new BSD
Platform: UNKNOWN
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development
Classifier: Topic :: Scientific/Engineering
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Provides-Extra: tests
Provides-Extra: docs
License-File: LICENSE

Density Peaks Advanced clustering
=================================

Status of the `scikit-learn`_ compatibility test:

.. image:: https://github.com/mariaderrico/DPA/actions/workflows/runpytest.yml/badge.svg?branch=master
   :alt: scikit-learn compatibility test status on GitHub Actions
   :target: https://github.com/mariaderrico/DPA/actions/workflows/runpytest.yml



The DPA package implements the Density Peaks Advanced (DPA) clustering algorithm as introduced in the paper "Automatic topography of high-dimensional data sets by non-parametric Density Peak clustering", published on `M. d'Errico, E. Facco, A. Laio, A. Rodriguez, Information Sciences, Volume 560, June 2021, 476-492`_  (also available on `arXiv`_).


The package offers the following features:

* Intrinsic dimensionality estimation by means of the `TWO-NN` algorithm, published in the `Estimating the intrinsic dimension of datasets by a minimal neighborhood information`_ paper.
* Adaptive k-NN Density estimation by means of the `PAk` algorithm, published in the `Computing the free energy without collective variables`_ paper.
* Advanced version of the `DP` clustering algorithm, published in the `Clustering by fast search and find of density peaks`_ paper, which includes an automatic search of cluster centers and assessment of statistical significance of the clusters  

.. contents::

Top-level directory layout
------------------------------

::

    cd DPA
    ls -l

::

    .
    |-- DP/                              # Auxiliary package with the DP clustering implementation.
    |-- docs/                            # Documentation files.
    |-- Examples/                        # Auxiliary scripts for the examples generations.
    |-- DPA_analysis.ipynb               # Use-case example for DPA.
    |-- DPA_comparison-all.ipynb         # Performance comparison with other clustering methods.
    |-- README.rst
    |-- compile.sh
    |-- setup.py
    |-- src/                             # Source files for DPA, PAk and twoNN algorithms.


Source files
------------

The source Python codes are stored inside the ``src`` folder.

::

    .
    |-- ...
    |-- src/
    |   |-- Pipeline/
    |       |-- __init__.py
    |       |-- DPA.py           # Python module implementing the DPA
    |       |                    # clustering algorithm.
    |       |
    |       |-- _DPA.pyx         # Cython extension of the DPA module.
    |       |
    |       |-- PAk.py           # Python module implementing the PAk
    |       |                    # density estimator.
    |       |
    |       |-- _PAk.pyx         # Cython extension of the PAk module.
    |       |
    |       |-- twoNN.py         # Python module implementing the TWO-NN
    |                            # algorithm for the ID calculation.
    |
    |-- ...   

Documentation files
-------------------

Full documentation about the Python codes developed and the how-to instructions is created in the ``docs`` folder using `Sphinx`.
Complete documentation for DPA is available on the  `Read The Docs <https://dpaclustering.readthedocs.org>`_ website.

Jupyter notebooks
-----------------

Examples of how-to run the ``DPA``, ``PAk`` and ``twoNN`` modules are provided as Jupyter notebook in ``DPA_analysis.ipynb``. Additional useful use-cases are available in ``DPA_comparison-all.ipynb``, which include a performance comparison with the following clustering methods: Bayesian Gaussian Mixture, HDBSCAN, Spectral Clustering and Density Peaks.

Both jupyter notebooks are also available as Python script (saved using `jupytext`_) in the ``jupytext`` folder.
::

    .
    |-- ...
    |-- DPA_analysis.ipynb               # Use-case example for DPA.
    |-- DPA_comparison-all.ipynb         # Performance comparison with
    |                                    # other clustering methods.
    |    
    |-- ...
    |-- jupytext/
    |   |-- DPA_analysis.py              # DPA_analysis.ipynb saved as
    |   |                                # Python script.
    |   |-- DPA_comparison-all.py        # DPA_comparison-all.ipynb
    |                                    # saved as Python script.


Getting started
---------------

The source code of DPA is on `github DPA repository`_. 

You need the ``git`` command in order to be able to clone it, and we
suggest you to use Python virtual environment in order to create a
controlled environment in which you can install DPA as
normal user avoiding conflicts with system files or Python libraries.

The following section documents the steps required to install DPA on a Linux or Windows/Mac computer.


Debian/Ubuntu
^^^^^^^^^^^^^

Run the following commands to create and activate a Python virtual environment with *python virtualenv*::

    apt-get install git python-dev virtualenv*
    virtualenv -p python3 venvdpa
    . venvdpa/bin/activate


Windows
^^^^^^^


A possible setup makes use of `Anaconda`_.
It has preinstalled and configured packages for data analysis and it is available on all major platforms. It uses *conda* as package manager, in addition to the standard pip.

A versioning control can be installed by downloading `git`_.

Run the following commands to activate the conda virtual environment::

    conda create -n venvdpa
    conda activate venvdpa

to list the available environments you can type ``conda info --envs``, and to deactivate an active environment use ``source deactivate``.


Installation
------------

Install required dependencies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The DPA package depends on ``easycython``, that can be installed using ``conda`` or ``pip``.
Note that it is possible to check which packages are installed with the ``pip freeze`` command.


Installing released code from GitHub
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Install the latest version from the GitHub repository via::

    pip install git+https://github.com/mariaderrico/DPA

Installing development code from GitHub
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Run the following commands to download the DPA source code::

    git clone https://github.com/mariaderrico/DPA.git

Install DPA with the following commands::

    cd DPA
    . compile.sh



Citing
------

If you have used this codebase in a scientific publication and wish to cite the algorithm, please cite our paper in Information Sciences.


    `M. d'Errico, E. Facco, A. Laio, A. Rodriguez, Information Sciences, Volume 560, June 2021, 476-492`_


.. code:: bibtex

    @article{DERRICO2021476,
      title = {Automatic topography of high-dimensional data sets by non-parametric density peak clustering},
      journal = {Information Sciences},
      volume = {560},
      pages = {476-492},
      year = {2021},
      issn = {0020-0255},
      doi = {https://doi.org/10.1016/j.ins.2021.01.010},
      url = {https://www.sciencedirect.com/science/article/pii/S0020025521000116},
      author = {Maria d’Errico and Elena Facco and Alessandro Laio and Alex Rodriguez},
      }



.. References

.. _`scikit-learn`: https://scikit-learn.org/stable/
.. _`M. d'Errico, E. Facco, A. Laio, A. Rodriguez, Information Sciences, Volume 560, June 2021, 476-492`: https://www.sciencedirect.com/science/article/pii/S0020025521000116?dgcid=author
.. _`arXiv`: https://arxiv.org/abs/1802.10549v2
.. _`Computing the free energy without collective variables`: https://pubs.acs.org/doi/full/10.1021/acs.jctc.7b00916 
.. _`Estimating the intrinsic dimension of datasets by a minimal neighborhood information`: https://export.arxiv.org/pdf/1803.06992 
.. _`Clustering by fast search and find of density peaks`: http://science.sciencemag.org/content/344/6191/1492.full.pdf
.. _`github DPA repository`: https://github.com/mariaderrico/DPA.git
.. _`Anaconda`: https://www.anaconda.com/download/#windows
.. _`git`: https://git-scm.com
.. _`jupytext`: https://pypi.org/project/jupytext/


