Metadata-Version: 2.1
Name: diffpass
Version: 0.0.1
Summary: Differentiable Pairing using Soft Scores
Home-page: https://github.com/ulupo/DiffPaSS
Author: Umberto Lupo and Damiano Sgarbossa
Author-email: umberto.lupo@epfl.ch, damiano.sgarbossa@epfl.ch
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: matplotlib
Requires-Dist: biopython
Requires-Dist: tqdm
Requires-Dist: pandas
Provides-Extra: dev
Requires-Dist: nbdev; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: jupyter; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"

# DiffPaSS – Differentiable Pairing using Soft Scores

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Overview

DiffPaSS is a family of high-performance and scalable PyTorch modules
for finding optimal one-to-one pairings between two collections of
biological sequences, and for performing general graph alignment.

### Pairing multiple-sequence alignments (MSAs)

A typical example of the problem DiffPaSS is designed to solve is the
following: given two multiple sequence alignments (MSAs) A and B,
containing interacting biological sequences, find the optimal one-to-one
pairing between the sequences in A and B.

![](media/MSA_pairing_problem.svg) *Pairing problem for two multiple
sequence alignments, where pairings are restricted to be within the same
species*

To find an optimal pairing, we can maximize the average mutual
information between columns of the two paired MSAs
([`InformationPairing`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#informationpairing)),
or we can maximize the similarity between distance-based
([`MirrortreePairing`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#mirrortreepairing))
or orthology
([`BestHitsPairing`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#besthitspairing))
networks constructed from the two MSAs.

### Graph alignment and pairing unaligned sequence collections

DiffPaSS can be used for general graph alignment problems
([`GraphAlignment`](https://Bitbol-Lab.github.io/DiffPaSS/train.html#graphalignment)),
where the goal is to find the one-to-one pairing between the nodes of
two weighted graphs that maximizes the similarity between the two
graphs. The user can specify the (dis-)similarity measure to be
optimized, as an arbitrary differentiable function of the adjacency
matrices of the two graphs.

Using this capability, DiffPaSS can be used for finding the optimal
one-to-one pairing between two unaligned collections of sequences, if
weighted graphs are built in advance from the two collections (for
example, using the pairwise Levenshtein distance). This is useful when
alignments are not available or reliable.

### Can I pair two collections with a different number of sequences?

DiffPaSS optimizes and returns permutation matrices. Hence, its inputs
are required to have the same number of sequences. However, DiffPaSS can
be used to pair two collections (e.g. MSAs) containing a different
number of sequences, by padding the smaller collection with dummy
sequences. For multiple sequence alignments, a simple choice is to add
dummy sequences consisting entirely of gap symbols. For general graphs,
dummy nodes, connected to the other nodes with arbitrary edge weights,
can be added to the smaller graph.

### How DiffPaSS works: soft scores, differentiable optimization, bootstrap

Check [our paper](https://openreview.net/forum?id=n5hO5seROB) for
details of the DiffPaSS and DiffPaSS-IPA algorithms. Briefly, the main
ingredients are as follows:

1.  Using “soft” scores that differentiably extend information-theoretic
    scores between two paired multiple sequence alignments (MSAs), or
    scores based on sequence similarity or graph similarity measures.

2.  The (truncated) Sinkhorn operator for smoothly parametrizing “soft
    permutations”, and the matching operator for parametrizing real
    permutations [\[Mena et al,
    2018\]](https://openreview.net/forum?id=Byt3oJ-0W).

3.  A novel and efficient bootstrap technique, motivated by mathematical
    results and heuristic insights into this smooth optimization
    process. See the animation below for an illustration.

4.  A notion of “robust pairs” that can be used to identify pairs that
    are consistently found throughout a DiffPaSS bootstrap. These pairs
    can be used as ground truths in another DiffPaSS run, giving rise to
    the DiffPaSS-Iterative Pairing Algorithm (DiffPaSS-IPA).

<p>
<video src="https://github.com/Bitbol-Lab/DiffPaSS/assets/46537483/e411fe8c-2fed-4723-a25c-ff69a1abccec" width="432" height="243" controls>
</video>
<em>The DiffPaSS bootstrap technique and robust pairs</em>
</p>

## Install

Clone this repository on your local machine by running and move inside
the root folder. We recommend creating and activating a dedicated conda
or virtualenv Python virtual environment.

``` sh
git clone git@github.com:Bitbol-Lab/DiffPaSS.git
```

and move inside the root folder. We recommend creating and activating a
dedicated conda or virtualenv Python virtual environment. Then, make an
editable install of the package:

``` sh
python -m pip install -e .
```

## Tutorials

See the
[`mutual_information_msa_pairing.ipynb`](https://github.com/Bitbol-Lab/DiffPaSS/blob/main/mutual_information_msa_pairing.ipynb)
notebook for an example of paired MSA optimization in the case of
well-known prokaryotic datasets, for which ground truth pairings are
given by genome proximity.

## Citation

To cite this work, please refer to the following publication:

``` bibtex
@inproceedings{
  lupo2024diffpass,
  title={DiffPa{SS} {\textendash} Differentiable and scalable pairing of biological sequences using soft scores},
  author={Umberto Lupo and Damiano Sgarbossa and Martina Milighetti and Anne-Florence Bitbol},
  booktitle={ICLR 2024 Workshop on Generative and Experimental Perspectives for Biomolecular Design},
  year={2024},
  url={https://openreview.net/forum?id=n5hO5seROB}
}
```

## nbdev

Project developed using [nbdev](https://nbdev.fast.ai/).
