Metadata-Version: 2.1
Name: hic3defdr
Version: 0.1.1
Summary: a genome-scale differential loop finder
Home-page: https://bitbucket.org/creminslab/hic3defdr
Author: Thomas Gilgenast
License: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 2.7
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Description-Content-Type: text/markdown
Requires-Dist: numpy (>=1.14.0)
Requires-Dist: scipy (>=1.2.0)
Requires-Dist: matplotlib (>=2.1.1)
Requires-Dist: mpl-scatter-density (>=0.6)
Requires-Dist: seaborn (>=0.8.0)
Requires-Dist: pandas (>=0.21.0)
Requires-Dist: lib5c (>=0.5.3)
Requires-Dist: dill (>=0.2.9)
Provides-Extra: complete
Requires-Dist: flake8 (>=3.4.1) ; extra == 'complete'
Requires-Dist: nose-exclude (>=0.5.0) ; extra == 'complete'
Requires-Dist: nose (>=1.3.7) ; extra == 'complete'
Requires-Dist: scikit-learn (>=0.20.3) ; extra == 'complete'
Requires-Dist: tqdm (>=4.32.2) ; extra == 'complete'
Provides-Extra: evaluation
Requires-Dist: scikit-learn (>=0.20.3) ; extra == 'evaluation'
Provides-Extra: progress
Requires-Dist: tqdm (>=4.32.2) ; extra == 'progress'
Provides-Extra: test
Requires-Dist: nose (>=1.3.7) ; extra == 'test'
Requires-Dist: nose-exclude (>=0.5.0) ; extra == 'test'
Requires-Dist: flake8 (>=3.4.1) ; extra == 'test'

hic3defdr
=========

![PyPI - Python Version](https://img.shields.io/pypi/pyversions/hic3defdr.svg)
[![PyPI - Wheel](https://img.shields.io/pypi/wheel/hic3defdr.svg)](https://pypi.org/project/hic3defdr)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/hic3defdr.svg)](https://pypi.org/project/hic3defdr)
[![Bitbucket Pipelines](https://img.shields.io/bitbucket/pipelines/creminslab/3defdr-hic.svg)](https://bitbucket.org/creminslab/3defdr-hic/addon/pipelines/home)

A genome-scale differential loop finder.

For the latest source, discussion, etc, please visit the Bitbucket repository at
https://bitbucket.org/creminslab/hic3defdr

Installation
------------

We require Python 2.7 and the dependencies listed in `setup.py`.

A typical quick install process should be:

    $ virtualenv venv
    $ source venv/bin/activate
    (venv)$ python -m pip install -U pip
    (venv)$ pip install hic3defdr

A typical dev-mode install process should be:

    $ git clone https://bitbucket.org/creminslab/hic3defdr.git
    $ cd hic3defdr
    $ virtualenv venv
    $ source venv/bin/activate
    (venv)$ python -m pip install -U pip
    (venv)$ pip install -e .

If installation succeeded then `hic3defdr.HiC3DeFDR` should be importable from
an interactive shell started in some other directory:

    (venv)$ cd <some other directory>
    (venv)$ python
    >>> from hic3defdr import HiC3DeFDR

### Optional dependencies

Evaluating simulations requires scikit-learn:

    (venv)$ pip install scikit-learn

To display progress bars during selected steps of the analysis, install [tqdm](https://github.com/tqdm/tqdm):

    (venv)$ pip install tqdm

To execute tests, install the following:

    (venv)$ pip install nose nose-exclude flake8

Basic walkthrough
-----------------

Before we start, we'll seed numpy's random number generator for reproducibility:

    >>> import numpy as np
    >>> np.random.seed(42)

To analyze the ES_1, ES_3, NPC_1, and NPC_2 reps of the dataset dataset from
[Bonev et al. 2017](https://www.ncbi.nlm.nih.gov/pubmed/29053968) with default
parameters, we would first describe the dataset in terms of replicate names,
chromosome names, and a design matrix. We will just analyze chromosomes 18 and
19 here for illustrative purposes.

    >>> import pandas as pd
    >>>
    >>> repnames = ['ES_1', 'ES_3', 'NPC_1', 'NPC_2']
    >>> #chroms = ['chr%i' % i for i in range(1, 20)] + ['chrX']
    >>> chroms = ['chr18', 'chr19']
    >>> design = pd.DataFrame({'ES': [1, 1, 0, 0], 'NPC': [0, 0, 1, 1]},
    ...                       dtype=bool, index=repnames)

If you're following along, you can download the data like this:

    $ mkdir -p ~/data/bonev
    $ wget -qO- -O tmp.zip https://www.dropbox.com/sh/mq0fpnp4jz59wpo/AAD2FW1Tp_mVKCkxlJoZvxC8a?dl=1 && unzip tmp.zip -x / -d ~/data/bonev && rm tmp.zip

The required input files consist of:

 - upper triangular, raw contact matrices in `scipy.sparse` NPZ format,
 - bias vectors in plain-text `np.savetxt()` format, and
 - loop cluster files in sparse JSON format (see below for more details),
   specifying the locations of loops present in each condition

TODO: explain how to import data from common Hi-C analysis tools into this
format

We would next describe the location of the input data files and use those to
construct a `HiC3DeFDR` object:

    >>> import os.path
    >>> from hic3defdr import HiC3DeFDR
    >>>
    >>> base_path = os.path.expanduser('~/data/bonev/')
    >>> h = HiC3DeFDR(
    ...     raw_npz_patterns=[base_path + '<rep>/<chrom>_raw.npz'.replace('<rep>', repname) for repname in repnames],
    ...     bias_patterns=[base_path + '<rep>/<chrom>_kr.bias'.replace('<rep>', repname) for repname in repnames],
    ...     chroms=chroms,
    ...     design=design,
    ...     outdir='output',
    ...     loop_patterns={c: base_path + 'clusters/%s_<chrom>_clusters.json' % c for c in ['ES', 'NPC']},
    ...     res=10000
    ... )
    creating directory output

This object saves itself to disk, so it can be re-loaded at any time:

    >>> h = HiC3DeFDR.load('output')

To run the analysis for all chromosomes through q-values, run:

    >>> h.run_to_qvalues()

To threshold, cluster, and classify the significantly differential loops, and
collect all the results into a single TSV output file, run:

    >>> h.collect()

The output file will be written to `output/results_0.01_3.tsv`, where "output"
refers to the `outdir` we passed when constructing the `HiC3DeFDR` object,
"0.01" refers to the default FDR of 1%, and "3" refers to the default cluster
size threshold of 3.

    >>> import pandas as pd
    >>> pd.read_csv('output/results_0.05_3.tsv', sep='\t', index_col=0).head()
                                                us_chrom  ...  classification
    loop_id                                               ...                
    chr18:3480000-3500000_chr18:4680000-4710000    chr18  ...    constitutive
    chr18:3490000-3510000_chr18:3790000-3810000    chr18  ...              ES
    chr18:3490000-3510000_chr18:3970000-3990000    chr18  ...    constitutive
    chr18:3490000-3510000_chr18:4170000-4200000    chr18  ...    constitutive
    chr18:3490000-3520000_chr18:4120000-4150000    chr18  ...    constitutive
    <BLANKLINE>
    [5 rows x 9 columns]

See the section "TSV output format" below for more details about the output
format.

Step-by-step walkthrough
------------------------

Calling `h.run_to_qvalues()` runs the four main steps of the analysis in
sequence. These four steps are described in further detail below. Any kwargs
passed to `h.run_to_qvalues()` will be passed along to the appropriate step; see
`help(HiC3DeFDR.run_to_qvalues)` for details.

### Step 1: Preparing input data

The function call `h.prepare_data()` prepares the input raw contact matrices and
bias vectors specified by `h.raw_npz_patterns` and `h.bias_patterns` for all
chromosomes specified in `h.chroms`, performs library size normalization, and
determines what points should be considered for dispersion estimation. This
creates intermediate files in the output directory `h.outdir` that represent the
raw and scaled values, as well as the estimated size factors, and a boolean
vector `disp_idx` indicating which points will be used for dispersion
estimation. If `loop_patterns` was passed to the constructor, an additional
boolean vector `loop_idx` is created to mark which points that pass `disp_idx`
lie within looping interactions specified by `h.loop_patterns`. The raw and
scaled data are stored in a rectangular matrix format where each row is a pixel
of the contact matrix and each column is a replicate. If the size factors are
estimated as a function of distance, the estimated size factors are also stored
in this format. Two separate vectors called `row` and `col` are used to store
the row and column index of the pixel represented by each row of the rectangular
matrices. Together, the `row` and `col` vectors plus any of the rectangular
matrices represent a generalization of a COO format sparse matrix to multiple
replicates (in the standard COO format the `row` and `col` vectors are
complemented by a single `value` vector).

The size factors can be estimated with a variety of methods defined in the
`hic3defdr.scaling` module. The method to use is specified by the `norm` kwarg
passed to `h.prepare_data()`. Some of these methods estimate size factors as a
function of interaction distance, instead of simply estimating one size factor
for each replicate as is common in RNA-seq differential expression analysis.
When these methods are used, the number of bins to use when binning distances
for distance-based estimation of the size factors can be specified with the
`n_bins` kwarg. The defaults for this function use the conditional median of
ratios approach (`hic3defdr.scaling.conditional_mor`) and an
automatically-selected number of bins.

### Step 2: Estimating dispersions

The function call `h.estimate_disp()` estimates the dispersion parameters at
each distance scale in the data and fits a lowess curve through the graph of
distance versus dispersion to obtain final smoothed dispersion estimates for
each pixel.

The `estimator` kwarg on `h.estimate_disp()` specifies which dispersion
estimation method to use, out of a selection of options defined in the
`hic3defdr.dispersion` module. The default is to use quantile-adjusted
conditional maximum likelihood (qCML).

### Step 3: Likelihood ratio test

The function call `h.lrt()` performs a likelihood ratio test (LRT) for each
pixel. This LRT compares a null model of no differential expression (fitting one
true mean parameter shared by all replicates irrespective of condition) to an
alternative model in which the two conditions have different true mean
parameters.

### Step 4: False discovery rate (FDR) control

The function call `h.bh()` performs Benjamini-Hochberg FDR correction on the
p-values called via the LRT in the previous step, considering only a subset of
pixels that are involved in looping interactions (as specified by
`h.loop_patterns`; if `loop_patterns` was not passed to the constructor then all
pixels are included in the BH-FDR correction). This results in final q-values
for each loop pixel.

### Thresholding, clustering, and classification

We can threshold, cluster, classify, and collect the significantly differential
loops:

    >>> h.collect(fdr=0.05, cluster_size=3)

We can also sweep across FDR and/or cluster size thresholds:

    >>> h.collect(fdr=[0.01, 0.05], cluster_size=[3, 4])

The final output file for each combination of thresholding parameters will be
written to `<h.outdir>/results_<fdr>_<cluster_size>.tsv`.

This example walkthrough should take less than 5 minutes for the two chromosomes
included in the demo data. Run time for a whole-genome analysis will depend on
parallelization as discussed in the next section.

Parallelization
---------------

The functions `run_to_qvalues()`, `prepare_data()`, `lrt()`, `threshold()`,
`classify()`, and `simulate()` operate in a per-chromosome manner. By default,
each chromosome in the dataset will be processed in series. If multiple cores
and sufficient memory are available, you can use the `n_threads` kwarg on these
functions to use multiple subprocesses to process multiple chromosomes in
parallel. Pass either a desired number of subprocesses to use, or pass
`n_threads=-1` to use all available cores. The output logged to stderr will be
truncated to reduce clutter from multiple subprocesses printing to stderr at the
same time. This truncation is controlled by the `verbose` kwarg which is
available on some of these parallelized functions.

The function `estimate_disp()` also accepts an `n_threads` kwarg, using it to
parallelize itself across distance scales.

The function `run_to_qvalues()` passes the `n_threads` kwarg through to all the
steps it calls.

Intermediates and final output files
------------------------------------

All intermediates used in the computation will be saved to the disk inside the
`outdir` folder as `<intermediate>_<chrom>.npy` (most intermediates),
`<intermediate>_<chrom>.json`/`<intermediate>_<chrom>.tsv` (thresholded or
classified clusters).

Two intermediates are not generated per chromosome. These are
`disp_fn_<c>.pickle` (dispersion function estimated across all chromosomes for
condition `<c>`) and `results_<f>_<s>.tsv` (final combined results from all
chromosomes).

| Step              | Intermediate      | Shape                               | Description                                 |
|-------------------|-------------------|-------------------------------------|---------------------------------------------|
| `prepare_data()`  | `row`             | `(n_pixels,)`                       | Top-level row index                         |
| `prepare_data()`  | `col`             | `(n_pixels,)`                       | Top-level column index                      |
| `prepare_data()`  | `bias`            | `(n_bins, n_reps)`                  | Bias vectors                                |
| `prepare_data()`  | `raw`             | `(n_pixels, n_reps)`                | Raw count values                            |
| `prepare_data()`  | `size_factors`    | `(n_reps,)` or `(n_pixels, n_reps)` | Size factors                                |
| `prepare_data()`  | `scaled`          | `(n_pixels, n_reps)`                | Normalized count values                     |
| `prepare_data()`  | `disp_idx`        | `(n_pixels,)`                       | Marks pixels for which dispersion is fitted |
| `prepare_data()`  | `loop_idx`        | `(disp_idx.sum(),)`                 | Marks pixels which lie in loops             |
| `estimate_disp()` | `cov_per_bin`     | `(n_bins, n_conds)`                 | Average mean count or distance in each bin  |
| `estimate_disp()` | `disp_per_bin`    | `(n_bins, n_conds)`                 | Pooled dispersion estimates in each bin     |
| `estimate_disp()` | `disp_fn_<c>`     | pickled function                    | Fitted dispersion function                  |
| `estimate_disp()` | `disp`            | `(disp_idx.sum(), n_conds)`         | Smoothed dispersion estimates               |
| `lrt()`           | `mu_hat_null`     | `(disp_idx.sum(),)`                 | Null model mean parameters                  |
| `lrt()`           | `mu_hat_alt`      | `(disp_idx.sum(), n_conds)`         | Alternative model mean parameters           |
| `lrt()`           | `llr`             | `(disp_idx.sum(),)`                 | Log-likelihood ratio                        |
| `lrt()`           | `pvalues`         | `(disp_idx.sum(),)`                 | LRT-based p-value                           |
| `bh()`            | `qvalues`         | `(loop_idx.sum(),)`                 | BH-corrected q-values                       |
| `threshold()`     | `sig_<f>_<s>`     | JSON/TSV                            | Significantly differential clusters         |
| `threshold()`     | `insig_<f>_<s>`   | JSON/TSV                            | Constitutive clusters                       |
| `classify()`      | `<c>_<f>_<s>`     | JSON/TSV                            | Classified differential clusters            |
| `collect()`       | `results_<f>_<s>` | TSV                                 | Final results table                         |

The table uses these abbreviations to refer to variable parts of certain
intermediate names:

 - `<f>`: FDR threshold
 - `<s>`: cluster size threshold
 - `<c>`: condition/class label

Sparse JSON cluster format
--------------------------

This is the format used both for supplying pre-identified, per-condition loop
clusters as input to HiC3DeFDR as well as the format in which differential and
constitutive interaction clusters are reported as output.

The format describes the clusters on each chromosome in a separate JSON file.
This JSON file contains a single JSON object, which is a list of list of list of
integers. The inner lists are all of length 2 and represent specific pixels of
the heatmap for that chromosome in terms of there row and column coordinates in
zero-indexed bin units. The outer middle lists can be of any length and
represent groups of pixels that belong to the same cluster. The length of the
outer list corresponds to the number of clusters on that chromosome.

These clusters can be loaded into and written from the corresponding plain
Python objects using `hic3defdr.clusters.load_clusters()` and
`hic3defdr.clusters.save_clusters()`, respectively. The plain Python objects can
be converted to and from scipy sparse matrix objects using
`hic3defdr.clusters.clusters_to_coo()` and
`hic3defdr.clusters.convert_cluster_array_to_sparse()`, respectively.

TSV output format
-----------------

The final TSV output file `results_<f>_<s>.tsv` has as its first column a
`loop_id`, a string of the form
`<us_chrom>:<us_start>-<us_end>_<ds_chrom>:<ds_start>-<ds_end>` specifying a
rectangle that surrounds the cluster of pixels that make up the loop. The next
six columns are the six individual pieces of this `loop_id`. This is followed
by:
 - `cluster_size`: the number of pixels in the cluster
 - `cluster`: a list of the exact (row, col) indices of the pixels in the
   cluster
 - `classification`: this will be "constitutive" if the loop is not
   differential, or the name of the condition the loop is specific to (i.e.,
   strongest in) if it is differential

Visualizations
--------------

The `HiC3DeFDR` object can be used to draw visualizations of the analysis.

The visualization functions are wrapped with the [`@plotter` decorator](https://lib5c.readthedocs.io/en/latest/plotting/)
and therefore all support the convenience kwargs provided by that decorator
(such as `outfile`).

### Distance dependence curves before and after scaling

    >>> _ = h.plot_dd_curves('chr18', outfile='dd.png')

![](images/dd.png)

### Simple heatmap plotting

    >>> _ = h.plot_heatmap('chr18', slice(1000, 1100), slice(1000, 1100), rep='ES_1',
    ...                    outfile='heatmap.png')

![](images/heatmap.png)

By default, this plots data at the "scaled" stage (normalized for bias and
sequencing depth differences), but you can plot any stage of the data by passing
a `stage` kwarg.

### Condition-average heatmap plotting

To plot a within-condition average heatmap, pass a `stage` name with a '_mean'
suffix appended and `cond` to specify the condition to average within:

    >>> _ = h.plot_heatmap('chr18', slice(1000, 1100), slice(1000, 1100),
    ...                    stage='scaled_mean', cond='ES',
    ...                    outfile='heatmap_mean.png')

![](images/heatmap_mean.png)

### Loop zoomin heatmap plotting

We can combine `h.plot_heatmap()` with the `load_clusters()` and
`cluster_to_slices()` utility functions in `hic3defdr.util.clusters` to plot
zoomins around specific loops:

    >>> from hic3defdr.util.clusters import load_clusters, cluster_to_slices
    >>> chrom = 'chr18'
    >>> clusters = load_clusters(h.loop_patterns['ES'].replace('<chrom>', chrom))
    >>> slices = cluster_to_slices(clusters[23])
    >>> _ = h.plot_heatmap(chrom, *slices, rep='ES_1', outfile='zoomin.png')

![](images/zoomin.png)

### Per-pixel significance plotting

We can pass `stage='qvalues'` to `h.plot_heatmap()` to draw heatmaps showing the
significance of each pixel:

    >>> _ = h.plot_heatmap('chr18', slice(1310, 1370), slice(1310, 1370),
    ...                    stage='qvalues', cmap='bwr_r', vmin=0.099, vmax=0.101,
    ...                    outfile='heatmap_sig.png')

![](images/heatmap_sig.png)

By passing `cmap='bwr_r'` we ensure that significant, low q-value pixels will be
red while insignificant, high q-value pixels will be blue. By passing
`vmin=0.099, vmax=0.101`, we ensure that the colorbar is focused on a narrow
range around an FDR threshold of 10%, allowing us to more easily tell the
difference between significant and insignificant pixels.

### Correlation matrices

    >>> _ = h.plot_correlation_matrix(outfile='correlation.png')

![](images/correlation.png)

### Dispersion fitting

    >>> _ = h.plot_dispersion_fit('ES', outfile='ddr.png')

![](images/ddr.png)

It's possible to use the the one-dimensional distance dependence curve to
convert distances to means. Doing so allows plotting the y-axis in terms of
variance. You can do this by passing `yaxis='var'`:

    >>> _ = h.plot_dispersion_fit('ES', yaxis='var', outfile='dvr.png')

![](images/dvr.png)

Using the same trick, you can plot the x-axis in terms of mean by passing
`xaxis='mean'`:

    >>> _ = h.plot_dispersion_fit('ES', xaxis='mean', yaxis='var',
    ...                           outfile='mvr.png')

![](images/mvr.png)

At low mean and high distance, the distance dependence curve flattens out and
the data become more noisy, making this conversion unreliable.

It's also possible to show the dispersion fitted at just one distance scale,
overlaying the sample mean and sample variance across replicates for each pixel
as a blue hexbin plot:

    >>> _ = h.plot_dispersion_fit('ES', distance=25, hexbin=True, xaxis='mean',
    ...                           yaxis='var', logx=True, logy=True,
    ...                           outfile='mvr_25.png')

![](images/mvr_25.png)

If dispersion was fitted against distance rather than mean, pass `xaxis='dist'`
to plot dispersion/variance versus distance.

### Comparing dispersion fits

It's possible to compare different dispersion fits using the function
`compare_disp_fits()` as shown here:

    >>> from hic3defdr import compare_disp_fits
    >>>
    >>> _ = compare_disp_fits(
    ...     [h.load_disp_fn(cond) for cond in h.design.columns],
    ...     h.design.columns,
    ...     max_dist=100,
    ...     outfile='disp_comparison.png'
    ... )

![](images/disp_comparison.png)

### P-value distribution

    >>> _ = h.plot_pvalue_distribution(outfile='pvalue_dist.png')

![](images/pvalue_dist.png)

By default, this plots the p-value distribution over all pixels for which
dispersion was estimated. To plot the p-value distribution only over points in
loops, pass `idx='loop'`.

### Q-value distribution

    >>> _ = h.plot_qvalue_distribution(outfile='qvalue_dist.png')

![](images/qvalue_dist.png)

### MA plot

    >>> _ = h.plot_ma(outfile='ma.png')

![](images/ma.png)

### Pixel detail grid

    >>> _ = h.plot_grid('chr18', 2218, 2236, 20, outfile='grid.png')

![](images/grid.png)

The upper right heatmaps show the balanced and scaled values in each replicate,
with each condition on its own row.

The upper left heatmaps show the alternative model mean parameter estimates for
each condition. Significantly differential clusters are purple while
constitutive ones are gray.

The lower left heatmap shows the q-values. Significantly differential clusters
are orange while constitutive ones are gray.

The stripplots in the lower left show details information about the specific
pixel in the center of the heatmaps (in this example `(2218, 2236)`). The dots
show the values at that pixel for each replicate in normalized and raw space,
repsectively. The solid and dashed lines represent the mean parameters under the
alt and null models, repsectively.

Green points in the heatmaps represent points that have been filtered out. For
the per-replicate heatmaps in the upper right of the grid, the only filters
applied are the zero filter, bias filter, and distance filter. For the alt model
mean heatmaps in the upper left, this additionally includes the dispersion
filter. For the q-value heatmap in the lower left, it additionally includes the
loop filter if loop locations were supplied.

### Interactive thresholding

In a Jupyter notebook environment with `ipywidgets` installed, you can play with
thresholds on a live-updating plot by running:

    %matplotlib notebook

    from ipywidgets import interact
    from hic3defdr import HiC3DeFDR

    h = HiC3DeFDR.load('output')
    _, _, outline_clusters = h.plot_grid('chr18', 2218, 2236, 50)
    _ = interact(outline_clusters, fdr=[0.01, 0.05, 0.1, 0.2],
                 cluster_size=[3, 4])

Simulation
----------

After the `estimate_disp()` step has been run, a HiC3DeFDR object can be used to
generate simulations of differential looping.

### Generating simulations

To create an ES-based simulation over all chromosomes listed in `h.chroms`, we
run

    >>> from hic3defdr import HiC3DeFDR
    >>>
    >>> h = HiC3DeFDR.load('output')
    >>> h.simulate('ES')

If we passed `trend='dist'` to `h.estimate_disp()`, we need to pass it to
`h.simulate()` as well to ensure that the simulation function knows to treat the
previously-fitted dispersion function as a function of distance.

This takes the mean of the real scaled data across the ES replicates and
perturbs the loops specified in `h.loop_patterns['ES']` up or down at random to
generate two new conditions called "A" and "B". The scaled mean matrices for
these conditions are then biased and scaled by the bias vectors and size factors
taken from the real experimental replicates, and the ES dispersion function
fitted to the real ES data is applied to the biased and scaled means to obtain
dispersion values. These means and dispersions are used to draw an NB random
variable for each pixel of each simulated replicate. The number of replicates in
each of the simulated conditions "A" and "B" will match the design of the real
analysis.

The simulated raw contact matrices will be written to disk in CSR format as
`<cond><rep>_<chrom>_raw.npz` where `<cond>` is "A" or "B" and `<rep>` is the
rep number within the condition. The design matrix will also be written to disk
as `design.csv`.

The true labels used to perturb the loops will also be written to disk as
`labels_<chrom>.txt`. This file contains as many lines as there are clusters in
`h.loop_patterns['ES']`, with the `i`th line providing the label for the `i`th
cluster. This file can be loaded with `np.loadtxt(..., dtype='|S7')`.

### Evaluating simulations

After generating simulated data, HiC3DeFDR can be run on the simulated data.
Then, the true labels can be used to evaluate the performance of HiC3DeFDR on
the simulated data.

Evaluation of simulated data requires scikit-learn. To install this package, run

    (venv)$ pip install scikit-learn

In order to run HiC3DeFDR on the simulated data, we first need to balance the
simulated raw contact matrices to obtain bias vectors for each simulated
replicate and chromosome. We will assume are saved next to the raw contact
matrices and named `<rep>_<chrom>_kr.bias`. One example of how this can be done
is to use the [hiclite library](https://bitbucket.org/creminslab/hiclite) and
the following script:

    >>> import sys
    >>>
    >>> import numpy as np
    >>> import scipy.sparse as sparse
    >>>
    >>> from hiclite.steps.filter import filter_sparse_rows_count
    >>> from hiclite.steps.balance import kr_balance
    >>>
    >>>
    >>> infile_pattern = 'sim/<rep>_<chrom>_raw.npz'
    >>> repnames = ['A1', 'A2', 'B1', 'B2']
    >>> chroms = ['chr18', 'chr19']
    >>>
    >>> for repname in repnames:
    ...     for chrom in chroms:
    ...         sys.stderr.write('balancing rep %s chrom %s\n' % (repname, chrom))
    ...         infile = infile_pattern.replace('<rep>', repname)\
    ...             .replace('<chrom>', chrom)
    ...         outfile = infile.replace('_raw.npz', '_kr.bias')
    ...         _, bias, _ = kr_balance(
    ...             filter_sparse_rows_count(sparse.load_npz(infile)), fl=0)
    ...         np.savetxt(outfile, bias)

Next, we create a new HiC3DeFDR object to analyze the simulated data and run the
analysis through to q-values:

    >>> import os.path
    >>> from hic3defdr import HiC3DeFDR
    >>>
    >>> repnames = ['A1', 'A2', 'B1', 'B2']
    >>> chroms = ['chr18', 'chr19']
    >>> sim_path = 'sim/'
    >>> base_path = os.path.expanduser('~/data/bonev/')
    >>> h_sim = HiC3DeFDR(
    ...     raw_npz_patterns=[sim_path + '<rep>_<chrom>_raw.npz'.replace('<rep>', repname) for repname in repnames],
    ...     bias_patterns=[sim_path + '<rep>_<chrom>_kr.bias'.replace('<rep>', repname) for repname in repnames],
    ...     chroms=chroms,
    ...     design=sim_path + 'design.csv',
    ...     outdir='output-sim',
    ...     loop_patterns={'ES': base_path + 'clusters/ES_<chrom>_clusters.json'}
    ... )
    creating directory output-sim
    >>> h_sim.run_to_qvalues()

Next, we can evaluate the simulation against the clusters in
`h_sim.loop_patterns['ES']` with true labels from `sim/labels_<chrom>.txt`:

    >>> h_sim.evaluate('ES', 'sim/labels_<chrom>.txt')

This writes a file in `h_sim`'s output directory called `eval.npz`. This file
can be loaded with `np.load()` and has four keys whose values are all one
dimensional vectors:

 - `'thresh'`: the thresholds (in `1 - qvalue` space) which make up the convex
   edge of the ROC curve; all other vectors are parallel to this one
 - `'fdr'`: the observed false discovery rate at each threshold
 - `'tpr'`: the observed true positive rate at each threshold
 - `'fpr'`: the observed false positive rate at each threshold

`eval.npz` files (possibly across many runs) can be visualized as ROC curves and
FDR control curves by running:

    >>> import numpy as np
    >>> from hic3defdr import plot_roc, plot_fdr
    >>>
    >>> _ = plot_roc([np.load('output-sim/eval.npz')], ['hic3defdr'], outfile='roc.png')
    >>> _ = plot_fdr([np.load('output-sim/eval.npz')], ['hic3defdr'], outfile='fdr.png')

![](images/roc.png)
![](images/fdr.png)

Multiple `eval.npz` files can be compared in the same plot by simply adding
elements to the lists in these function calls.

The ROC plot shows FPR versus TPR, with the gray diagonal line representing the
performance of random guessing. The AUROC for each curve is shown in the legend.
If only one curve is plotted, selected thresholds (in units of FDR threshold)
are annotated with black arrows.

The FDR control plot shows the observed FDR as a function of the FDR threshold.
Points below the gray diagonal line represent points at which FDR is
successfully controlled.

As an added bonus, it's also possible to evaluate the performance on specific
subsets of distance scales by using the `min_dist` and `max_dist` kwargs on
`HiC3DeFDR.evaluate()` as illustrated below:

    >>> dist_bins = [
    ...    ('short', (None, 15)),  # distances in bin units
    ...    ('mid', (16, 30)),
    ...    ('long', (31, None))
    ... ]
    >>> for _, (min_dist, max_dist) in dist_bins:
    ...     h_sim.evaluate('ES', 'sim/labels_<chrom>.txt', min_dist=min_dist,
    ...                    max_dist=max_dist)
    >>> _ = plot_roc([np.load('output-sim/eval_%s_%s.npz' % (min_dist, max_dist))
    ...               for _, (min_dist, max_dist) in dist_bins],
    ...              [label for label, _ in dist_bins],
    ...              outfile='roc_by_dist.png')
    >>> _ = plot_fdr([np.load('output-sim/eval_%s_%s.npz' % (min_dist, max_dist))
    ...               for _, (min_dist, max_dist) in dist_bins],
    ...              [label for label, _ in dist_bins],
    ...              outfile='fdr_by_dist.png')

![](images/roc_by_dist.png)
![](images/fdr_by_dist.png)

It's also possible to compare the FPR and FNR at the different subsets:

    >>> from hic3defdr import plot_fn_vs_fp
    >>>
    >>> _ = plot_fn_vs_fp([np.load('output-sim/eval_%s_%s.npz' % (min_dist, max_dist))
    ...                    for _, (min_dist, max_dist) in dist_bins],
    ...                   [label for label, _ in dist_bins], xlabel='distance subset',
    ...                   outfile='fn_vs_fp.png')

![](images/fn_vs_fp.png)

Other visualizations
--------------------

### Distance bias

We can visualize and compare the distance bias of different simulations (the
degree to which low p-values are enriched or depleted in different distance
scales) using the `plot_distance_bias()` function as shown below:

    >>> from hic3defdr import plot_distance_bias
    >>>
    >>> dist_bins = [
    ...    ('short', (None, 15)),  # distances in bin units
    ...    ('mid', (16, 30)),
    ...    ('long', (31, None))
    ... ]
    >>> _ = plot_distance_bias([h, h_sim], [b for _, b in dist_bins], labels=['real', 'sim'], outfile='distance_bias.png')

![](images/distance_bias.png)

Package structure
-----------------

The `hic3defdr` package is layed out in three parts:

 1. `hic3defdr.util`: library of re-usable functions for performing computations
    related to differential loop calling
 2. `hic3defdr.plotting`: library of re-usable functions to plotting
    visualizations related to differential loop calling
 3. `hic3defdr.analysis`: a module that defines the HiC3DeFDR class, which
    provides an implementation of stitching together all the computational steps
    and visualizations in an easy-to-use way

The HiC3DeFDR class includes many methods, so to keep things organized these
methods are defined in four separate mixin classes which are combined to form
the full HiC3DeFDR class in `hic3defdr/analysis/__init__.py`:

 - `hic3defdr.analysis.core.CoreHiC3DeFDR`
 - `hic3defdr.analysis.analysis.AnalyzingHiC3DeFDR`
 - `hic3defdr.analysis.simulation.SimulatingHiC3DeFDR`
 - `hic3defdr.analysis.plotting.PlottingHiC3DeFDR`

We recommend that most users simply import the HiC3DeFDR class and interact with
this package through that interface, but the functions defined in
`hic3efdr.util` and `hic3defdr.plotting` may also be useful to some users and
are designed to be somewhat re-usable.

The complete layout of the package is summarized below:

```
hic3defdr/                      # package root
├── _version.py                 # manages version information using versioneer
├── analysis/                   # HiC3DeFDR class and alternatives
│   ├── __init__.py             # defines HiC3DeFDR class
│   ├── alternatives.py         # defines alternative analysis models
│   ├── analysis.py             # HiC3DeFDR's pipeline methods
│   ├── core.py                 # HiC3DeFDR's core save/load methods
│   ├── plotting.py             # HiC3DeFDR's plotting methods
│   └── simulation.py           # HiC3DeFDR's simulation/evaluation methods
├── plotting/                   # library of plotting functions
│   ├── dispersion.py           # dispersion/variance fit visualizations
│   ├── distance_bias.py        # distance bias comparison barplots
│   ├── distance_dependence.py  # distance dependence curve comparison
│   ├── fdr.py                  # FDR control curve plotting
│   ├── fn_vs_fp.py             # FN vs FP tradeoff barplots
│   ├── grid.py                 # "pixel detail grid" combination visualization
│   ├── heatmap.py              # simple contact matrix heatmap plotting
│   ├── histograms.py           # p-value/q-value histograms
│   ├── ma.py                   # MA plots
│   └── roc.py                  # ROC curve plots
└── util/                       # library of utility functions 
    ├── binning.py              # creating groups of points
    ├── classification.py       # classifying differential loop pixels
    ├── cluster_table.py        # creating tables summarizing cluster info
    ├── clusters.py             # interacting with called loop clusters
    ├── dispersion.py           # estimating dispersions in NB data
    ├── evaluation.py           # evaluating results of simulations
    ├── printing.py             # printing updates and info to the console
    ├── lowess.py               # lowess fitting
    ├── lrt.py                  # NB likelihood ratio testing
    ├── matrices.py             # interacting with sparse matrices
    ├── parallelization.py      # parallelizing work across cores
    ├── progress.py             # showing progress bars
    ├── scaled_nb.py            # dealing with scaling factors in NB
    ├── scaling.py              # scaling reps to account for sequencing depth
    ├── simulation.py           # simulating pseudoreplicates
    └── thresholding.py         # thresholding differential pixels/clusters
```

Additional options
------------------

Additional options are exposed as kwargs on the functions in this library. Use
`help(<function>)` to get detailed information about the options available for
any function and what these options may be used for.


