Metadata-Version: 1.1
Name: RiboCode
Version: 1.2.3
Summary: A package for identifying the translated ORFs using ribosome-profiling data
Home-page: https://github.com/xzt41/RiboCode
Author: Zhengtao Xiao
Author-email: xzt13@mails.tsinghua.edu.cn
License: MIT
Description: Detect translated ORFs using ribosome-profiling data
        ====================================================
        
        *RiboCode* is a very simple but high-quality computational algorithm to
        identify genome-wide translated ORFs using ribosome-profiling data.
        
        Dependencies:
        -------------
        
        - pysam
        
        - pyfasta
        
        - h5py
        
        - Biopython
        
        - Numpy
        
        - Scipy
        
        - matplotlib
        
        - setuptools
        
        Installation
        ------------
        
        *RiboCode* can be installed like any other Python packages. Here are some
        popular ways:
        
        * Install from PyPI:
        
        .. code-block:: bash
        
           pip install RiboCode
        
        * Install from local:
        
        .. code-block:: bash
        
           pip install RiboCode-*.tar.gz
        
           If you have not administrator permission, you need to install *RiboCode* locally in you own directory by adding the
           option ``--user`` to installation commands. Then, you need to add ``~/.local/bin/`` to the ``PATH`` variable,
           and ``~/.local/lib/`` to the ``PYTHONPATH`` variable. For example, if you are using the bash shell, you would do
           this by adding the following lines to your ``~/.bashrc`` file:
        
        .. code-block:: bash
        
           export PATH=$PATH:$HOME/.local/bin/
           export PYTHONPATH=$HOME/.local/lib/python2.7
        
        You then need to source your ``~/.bashrc`` file by this command:
        
        .. code-block:: bash
        
           source ~/.bashrc
        
        Tutorial to analyze ribosome-profiling data and run *RiboCode*
        --------------------------------------------------------------
        
        Here, we use the `HEK293 dataset`_ as an example to illustrate the use of *RiboCode*.
        Please make sure the path of file is correctly.
        
        1. **Required files**
        
           The genome FASTA file, GTF file for annotation can be downloaded from:
        
        
           http://www.gencodegenes.org
        
           or from:
        
           http://asia.ensembl.org/info/data/ftp/index.html
        
           http://useast.ensembl.org/info/data/ftp/index.html
        
           For example, the required files in this tutorial can be downloaded from following URL:
        
           GTF: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
        
           FASTA: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/GRCh37.p13.genome.fa.gz
        
           The raw Ribo-seq FASTQ file can be download by using fastq-dump tool from `SRA_Toolkit`_:
        
           .. code-block:: bash
        
              fastq-dump -A <SRR1630831>
        
        2. **Trimming adapter sequence for ribo-seq data**
        
           Using cutadapt program https://cutadapt.readthedocs.io/en/stable/installation.html
        
           Example:
        
           .. code-block:: bash
        
              cutadapt -m 20 --match-read-wildcards -a (Adapter sequence) -o <Trimmed fastq file> <Input fastq file>
        
        
           Here, the adapter sequences for this data had already been trimmed off, so we can skip this step.
        
        3. **Removing ribosomal RNA(rRNA) derived reads**
        
           Align the trimmed reads to rRNA sequences using Bowtie, then select unaligned reads for the next step.
        
           Bowtie program http://bowtie-bio.sourceforge.net/index.shtml
        
           rRNA sequences: We provided a `rRNA.fa`_ file in data folder of this package.
        
           Example:
        
           .. code-block:: bash
        
              bowtie-build <rRNA.fa> rRNA
              bowtie -p 8 -norc --un un_aligned.fastq rRNA -q <SRR1630831.fastq> <HEK293_rRNA.align>
        
        4. **Aligning the clean reads to reference genome**
        
           Using STAR program: https://github.com/alexdobin/STAR
        
           Example:
        
           (1). Build index
        
           .. code-block:: bash
        
              STAR --runThreadN 8 --runMode genomeGenerate --genomeDir <hg19_STARindex>
              --genomeFastaFiles <hg19_genome.fa> --sjdbGTFfile <gencode.v19.annotation.gtf>
        
           (2). Alignment:
        
           .. code-block:: bash
        
              STAR --outFilterType BySJout --runThreadN 8 --outFilterMismatchNmax 2 --genomeDir <hg19_STARindex>
              --readFilesIn <un_aligned.fastq>  --outFileNamePrefix (HEK293) --outSAMtype BAM
              SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts --outFilterMultimapNmax 1
              --outFilterMatchNmin 16
        
        5. **Running RiboCode to identify translated ORFs**
        
           (1). Preparing the transcripts annotation files:
        
           .. code-block:: bash
        
              prepare_transcripts -g <gencode.v19.annotation.gtf> -f <hg19_genome.fa> -o <RiboCode_annot>
        
           (2). Selecting the length range of the RPF reads and identify the P-site locations:
        
           .. code-block:: bash
        
              metaplots -a <RiboCode_annot> -r <HEK293Aligned.toTranscriptome.out.bam>
        
        
           This step will generate a PDF file and a predefined P-site parameters file. The PDF file plots the aggregate profiles
           of the distance between the 5'-end of reads and the annotated start codons or stop codons. The P-site parameters file
           defines the read lengths which show strong 3-nt periodicity and the P-site locations for each length, users can modify
           this file according the plots in PDF file.
        
           (3). Detecting translated ORFs using the ribosome-profiling data:
        
           .. code-block:: bash
        
              RiboCode -a <RiboCode_annot> -c <config.txt> -l no -o <RiboCode_ORFs_result>
        
        
           Users can use or modify the config file generated by last step to specify the information of the bam file and P-site parameters,
           please refer to the example file in data folder.
        
           **Explanation of final result files**
        
           The *RiboCode* generates two text files as below:
           The "(output file name).txt" contains the information of predicted ORFs in each
           transcript; The "(output file name)_collapsed.txt" file combines the ORFs with the
           same stop codon in different transcript isoforms: the one harboring the most
           upstream in-frame ATG is chosen.
           Some column names of the result file::
        
            - ORF_ID: The identifier of ORFs that predicated.
            - ORF_type: The type of ORF. The following ORF categories are reported:
        
             "annotated" (overlapping annotated CDS, have the same stop with annnotated CDS)
        
             "uORF" (in upstream of annotated CDS, not overlapping annotated CDS)
        
             "dORF" (in downstream of annotated CDS, not overlapping annotated CDS)
        
             "Overlap_uORF" (in upstream of annotated CDS, overlapping annotated CDS)
        
             "Overlap_dORF" (in downstream of annotated CDS, overlapping annotated CDS"
        
             "Internal" (in internal of annotated CDS, but in a different frame relative annotated CDS)
        
             "novel" (in non-coding genes or non-coding transcripts of coding genes).
        
            - ORF_tstart, ORF_tstop: the beginning and end of ORF in RNA transcript (1-based coordinate)
            - ORF_gstart, ORF_gstop: the beginning and end of ORF in genome (1-based coordinate)
            - pval_frame0_vs_frame1: significance levels of P-site densities of frame0 greater than of frame1
            - pval_frame0_vs_frame2: significance levels of P-site densities of frame0 greater than of frame2
            - pval_combined: integrated P-value
        
           (4). (optional) plot the P-site densities of predicted ORFs
        
           Users can plot the density of predicted ORFs using the "parsing_plot_orf_density" command, as example below:
        
           .. code-block:: bash
        
              parsing_plot_orf_density -a <RiboCode_annot> -c <config.txt> -t (transcript_id)
              -s (ORF_gstart) -e (ORF_gstop)
        
        
        For any questions, please contact:
        ----------------------------------
        
           Zhengtao Xiao (xzt13@mails.tsinghua.edu.cn)
        
           Rongyao Huang (THUhry12@163.com)
        
           Xudong Xing (xudonxing_bioinf@sina.com)
        
        .. _SRA_Toolkit: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
        .. _HEK293 dataset: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1630831
        .. _config.txt: https://github.com/xzt41/RiboCode/blob/master/data/config.txt
        .. _rRNA.fa: https://github.com/xzt41/RiboCode/blob/master/data/rRNA.fa
        
Keywords: ribo-seq ribosome-profiling ORF
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Environment :: Console
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
