Metadata-Version: 2.1
Name: pyPINTS
Version: 0.0.1.post0.dev7
Summary: Peak Identifier for Nascent Transcripts Starts (PINTS)
Home-page: https://pints.yulab.org
Author: Li Yao
Author-email: regulatorygenome@gmail.com
License: GPL
Platform: UNKNOWN
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy (>=1.19.2)
Requires-Dist: pandas (>=1.1.5)
Requires-Dist: scipy (>=1.5.2)
Requires-Dist: pysam (>=0.16.0.1)
Requires-Dist: pybedtools (>=0.8.1)
Requires-Dist: statsmodels (>=0.12.1)
Requires-Dist: pyBigWig (==0.3.18)
Requires-Dist: biopython
Requires-Dist: matplotlib

# PINTS: Peak Identifier for Nascent Transcripts Starts
![](https://img.shields.io/badge/platform-linux%20%7C%20osx-lightgrey.svg)
![](https://img.shields.io/badge/python-3.x-blue.svg)
[![PyPI](https://github.com/liyao001/PINTS/actions/workflows/python-publish.yml/badge.svg)](https://github.com/liyao001/PINTS/actions/workflows/python-publish.yml)

## Installation
PINTS is available on PyPI, which means you can install it with the following command:
```shell
pip install pyPINTS
```
Alternatively, you can clone this repo to a local directory, then in the directory, run the following command:
```shell
python setup.py install
```

## Prerequisite
Python packages
* biopython
* matplotlib
* numpy
* pandas
* pybedtools
* pyBigWig
* pysam
* requests
* scipy
* statsmodels

## Get started
PINTS can call peaks directly from BAM files. To call peaks from BAM files, 
you need to provide the tool a path to the bam file and what kind of experiment it was from.
If it's from a standard protocol, like [PROcap](https://doi.org/10.1038/nprot.2016.086), then you can set `--exp-type PROcap`.  
Other supported experiments including [GROcap](https://doi.org/10.7554/eLife.00808)/
[CoPRO](https://doi.org/10.1038/s41588-018-0234-5)/
[csRNAseq](https://doi.org/10.1101/gr.253492.119)/
[NETCAGE](https://doi.org/10.1038/s41588-019-0485-9)/
[CAGE](https://doi.org/10.1038/nmeth0306-211)/
[RAMPAGE](https://doi.org/10.1101/gr.139618.112)/
[STRIPEseq](https://doi.org/10.1101/gr.261545.120). For a comprehensive list of directly supported assays, please run 
```shell
pints_caller --help
```

If the data was generated by other methods, you need to tell the tool where it can find ends of RNAs you are interested in.
For example, `--exp-type R_5` tells the tool that:
   1. this alignment is from a single-end library; 
   2. the tool should look at 5' of reads. Other supported values are `R_3`, `R1_5`, `R1_3`, `R2_5`, `R2_3`.

If reads represent the reverse complement of original RNAs, like PROseq, then you need to use `--reverse-complement` 
(not necessary for standard protocols).

One example for calling peaks from BAM file:
```shell
pints_caller --bam-file input.bam --save-to output_dir --file-prefix output_prefix --thread 16 --exp-type PROcap
```
Or you can call peaks from BigWig files:
```shell
pints_caller --save-to output_dir --file-prefix output_prefix --bw-pl path_to_pl.bw --bw-mn path_to_mn.bw --thread 16
```
If you want to call peaks from experiments with replicates:
```shell
pints_caller --bam-file input1.bam input2.bam --save-to output_dir --file-prefix output_prefix --thread 16 --exp-type PROcap
```

## Outputs
* prefix+`_{SID}_divergent_peaks.bed`: Divergent TREs;
* prefix+`_{SID}_bidirectional_peaks.bed`: Bidirectional TREs (divergent + convergent);
* prefix+`_{SID}_unidirectional_peaks.bed`: Unidirectional TREs, maybe lncRNAs transcribed from enhancers (e-lncRNAs) as suggested [here](http://www.nature.com/articles/s41576-019-0184-5).

`{SID}` will be replaced with the number of samples that peaks are called from,
  if you only provide PINTS with one sample, then `{SID}` will be replaced with **1**,
  if you try to use PINTS with three replicates (`--bam-file A.bam B.bam C.bam`), then `{SID}` for peaks identified from `A.bam` will be replaced with 1.

For divergent or bidirectional TREs, there will be 6 columns in the outputs:
1. Chromosome
2. Start site: 0-based
3. End site: 0-based 
4. Confidence about the peak pair. Can be: 
    * `Stringent(qval)`, which means the two peaks on both forward and reverse strands are significant based on their *q*-values; 
    * `Stringent(pval)`, which means one peak is significant according to *q*-value while the other one is significant according to *p*-value; 
    * `Relaxed`, which means only one peak is significant in the pair.
    * A combination of the three types above, because of overlap for nearby elements.
    * If epigenomic annotation is enabled by `--epig-annotation <biosample>`, then peaks that are less significant (`--relaxed-fdr-target`, default is 2*`fdr_target`), but overlap with epigenomic annotations from PINTS web server, will be listed with the confidence level: `Marginal`.
5. Major TSSs on the forward strand, if there are multiple major TSSs, they will be separated by comma `,`
6. Major TSSs on the reverse strand, if there are multiple major TSSs, they will be separated by comma `,`


For single TREs, there will be 6 columns in the output:
1. Chromosome
2. Start
3. End
4. Peak ID
5. Q-value
6. Strand

For all three types of TREs, if a valid biosample name for `--epig-annotation` is provided, then an additional column with epigenomic annotation for each TRE will show up in the final output.

## Parameters
### Input & Output
* If you want to use BAM files as inputs:
   * `--bam-file`: input bam file(s);
   * `--exp-type`: Type of experiment. If the experiment is not listed as a choice, or you know the position of RNA ends on the reads and you want to override the defaults, you can specify: 
     * `R_5` (5' of the read for single-end lib), 
     * `R_3` (3' of the read for single-end lib), 
     * `R1_5` (5' of the read1 for paired-end lib), 
     * `R1_3` (3' of the read1 for paired-end lib), 
     * `R2_5` (5' of the read2 for paired-end lib), 
     * or `R2_3` (3' of the read2 for paired-end lib)
   * `--reverse-complement`: Set this switch if 1) `exp-type` is `Rx_x` and 2) reads in this library represent the reverse complement of RNAs, like PROseq;
   * `--ct-bam`: Bam file for input/control (optional);
* If you want to use bigwig files as inputs:
  * `--bw-pl`: Bigwig for signals on the forward strand;
  * `--bw-mn`: Bigwig for signals on the reverse strand;
  * `--ct-bw-pl`: Bigwig for input/control signals on the forward strand (optional);
  * `--ct-bw-mn`: Bigwig for input/control signals on the reverse strand (optional);
* `--save-to`: save peaks to this path (a folder), by default, current folder
* `--file-prefix`: prefix to all outputs

### Optional parameters
* `--epig-annotation <biosample>`: Use this option together with the name of the biosample that the library was derived from, for example K562; then epigenomic annotations will be downloaded from the PINTS web server and used for annotating and augmenting TREs identified by PINTS **(for hg38 only)**;
* `--relaxed-fdr-target <relaxed fdr>`: In the presence of `--epig-annotation`, peaks that do not pass the original FDR cutoff but pass this relaxed cutoff and have support from DNase-seq and H3K27ac ChIP-seq will also be included in final outputs. By default, 2*fdr;
* `--mapq-threshold <min mapq>`: Minimum mapping quality, by default: 30 or `None`;
* `--close-threshold <close distance>`: Distance threshold for two peaks (on opposite strands) to be merged, by default: 300;
* `--fdr-target <fdr>`: FDR target for multiple testing, by default: 0.1;
* `--chromosome-start-with <chromosome prefix>`: Only keep reads mapped to chromosomes with this prefix, if it's set to `None`, then all reads will be analyzed;
* `--thread <n thread>`: Max number of threads the tool can create;
* `--borrow-info-reps`: Borrow information from reps to refine calling of divergent elements;
* `--output-diagnostic-plot`: Save diagnostic plots (independent filtering and pval dist) to local folder

More parameters can be seen by running `pints_caller -h`.

## Other tools
* `pints_boundary_extender`: Extend peaks from summits.
* `pints_visualizer`: Generate bigwig files for the inputs.
* `pints_normalizery`: Normalize inputs.

## Tips
1. Be cautious to reads mapped to scaffolds instead of main chromosome (for example the notorious `chrUn_gl000220` in `hg19`, they maybe rRNA contamination)!

## Contact
Please submit an issue with any questions or if you experience any issues/bugs.


