Metadata-Version: 2.1
Name: cpdseqer
Version: 0.1.3
Summary: CPDseq data analysis
Home-page: https://github.com/shengqh/cpdseqer
Author: Quanhu Sheng
Author-email: quanhu.sheng.1@vumc.org
License: UNKNOWN
Download-URL: https://github.com/shengqh/cpdseqer/archive/v0.1.3.tar.gz
Description: # CPDSeqer
        
        This package is used to do CPD sequence data analysis.
        
        # Prerequisites
        
        Install tabix
        
        ```
        apt-get install -y tabix
        ```
        
        # Installation
        
        Install python main package
        
        ```
        pip install git+git://github.com/shengqh/cpdseqer.git
        ```
        
        # Usage
        
        ## Demultiplex fastq file
        
        ```
        usage: cpdseqer demultiplex [-h] -i [INPUT] -o [OUTPUT] -b [BARCODEFILE]
        
        optional arguments:
          -h, --help            show this help message and exit
          -i [INPUT], --input [INPUT]
                                Input fastq gzipped file
          -o [OUTPUT], --output [OUTPUT]
                                Output folder
          -b [BARCODEFILE], --barcodeFile [BARCODEFILE]
                                Tab-delimited file, first column is barcode, second
                                column is sample name
        ```
        for example:
        ```
        cpdseqer demultiplex -i sample.fastq.gz -o . -b barcode.txt
        ```
        
        Example of [barcode.txt](https://github.com/shengqh/cpdseqer/raw/master/data/barcode.txt), columns are separated by tab
        ```
        ATCGCGAT        Control
        GAACTGAT        UV
        ```
        
        ## Align reads to genome using bowtie2
        
        ```
        bowtie2 -p 8  -x hg38/bowtie2_index_2.3.5.1/GRCh38.p12.genome -U Control.fastq.gz \
          --sam-RG ID:Control --sam-RG LB:Control --sam-RG SM:Control --sam-RG PL:ILLUMINA \
          -S Control.sam 2> Control.log
          
        samtools view -Shu -F 256 Control.sam | samtools sort -o Control.bam -T Control -
        
        samtools index Control.bam
        
        rm Control.sam
        ```
        
        You can download hg38 bowtie2 index files:
        
        ```
        mkdir hg38
        cd hg38
        wget https://cqsweb.app.vumc.org/download1/cpdseqer/hg38_bowtie2.tar.gz
        tar -xzvf hg38_bowtie2.tar.gz
        cd ..
        ```
        
        ## Extract dinucleotide from bam file
        
        ```
        usage: cpdseqer bam2dinucleotide [-h] -i [INPUT] -g [GENOME_SEQ_FILE]
                                         [-q [MAPPING_QUALITY]] -o [OUTPUT]
        
        optional arguments:
          -h, --help            show this help message and exit
          -i [INPUT], --input [INPUT]
                                Input BAM file
          -g [GENOME_SEQ_FILE], --genome_seq_file [GENOME_SEQ_FILE]
                                Input genome seq file
          -q [MAPPING_QUALITY], --mapping_quality [MAPPING_QUALITY]
                                Minimum mapping quality of read (default 20)
          -o [OUTPUT], --output [OUTPUT]
                                Output file name
        ```
        
        for example:
        
        ```
        cpdseqer bam2dinucleotide \
          -i Control.bam \
          -g hg38/bowtie2_index_2.3.4.1/GRCh38.p12.genome.fa \
          -o Control.dinucleotide.bed.bgz
        ```
        
        You can download bam files:
        
        ```
        wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.bam
        wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.bam.bai
        wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.bam
        wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.bam.bai
        ```
        
        ## Get statistic based on bed file
        
        ```
        usage: cpdseqer statistic [-h] -i [INPUT] -c [COORDINATE_LIST_FILE] [-s]
                                  [--category_index CATEGORY_INDEX] [--add_chr] -o
                                  [OUTPUT]
        
        optional arguments:
          -h, --help            show this help message and exit
          -i [INPUT], --input [INPUT]
                                Input dinucleotide list file, first column is file
                                location, second column is file name
          -c [COORDINATE_LIST_FILE], --coordinate_list_file [COORDINATE_LIST_FILE]
                                Input coordinate list file, first column is file
                                location, second column is file name
          -s, --space           Use space rather than tab in coordinate files
          --category_index CATEGORY_INDEX
                                Zero-based category column index in coordinate file
          --add_chr             Add chr in chromosome name in coordinate file
          -o [OUTPUT], --output [OUTPUT]
                                Output file name
        
        ```
        
        for example:
        
        ```
        cpdseqer statistic --category_index 3 \
          -i cpd__fileList1.list \
          -c cpd__fileList2.list \
          -o cpd.dinucleotide.statistic.tsv
        ```
        
        cpd__fileList1.list (separated by tab)
        
        ```
        Control.dinucleotide.bed.bgz      Control
        UV.dinucleotide.bed.bgz           UV
        ```
        
        cpd__fileList2.list (separated by tab)
        
        ```
        hg38_promoter.bed      Promoter
        hg38_tf.bed            TFBinding
        ```
        
        You can download example files as following scripts. The hg38_promoter.bed contains three columns only. So, Promoter (from  cpd__fileList2.list definition) will be used as category name for all entries in the hg38_promoter.bed. The hg38_tf.bed contians four columns. The forth column in hg38_tf.bed indicates TF name which will be used as category name (--category_index 3) instead of TFBinding.
        
        ```
        wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.dinucleotide.bed.bgz
        wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.dinucleotide.bed.bgz.tbi
        wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.dinucleotide.bed.bgz
        wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.dinucleotide.bed.bgz.tbi
        wget https://cqsweb.app.vumc.org/download1/cpdseqer/cpd__fileList1.list
        wget https://cqsweb.app.vumc.org/download1/cpdseqer/cpd__fileList2.list
        wget https://github.com/shengqh/cpdseqer/raw/master/data/hg38_promoter.bed
        wget https://github.com/shengqh/cpdseqer/raw/master/data/hg38_tf.bed
        
        
        ```
        
        # Running cpdseqer using singularity
        
        We also build docker container for cpdseqer which can be used by singularity.
        
        ## Running directly
        
        ```
        singularity exec -e docker://shengqh/cpdseqer bowtie2 -h
        singularity exec -e docker://shengqh/cpdseqer cpdseqer -h
        ```
        
        ## Convert docker image to singularity image first
        
        ```
        singularity build cpdseqer.simg docker://shengqh/cpdseqer
        singularity exec -e cpdseqer.simg bowtie2 -h
        singularity exec -e cpdseqer.simg cpdseqer -h
        ```
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
