Metadata-Version: 1.1
Name: pepdata
Version: 0.6.4
Summary: Python interface to IEDB and other immune epitope data
Home-page: https://github.com/hammerlab/pepdata
Author: Alex Rubinsteyn
Author-email: alex {dot} rubinsteyn {at} mssm {dot} edu
License: http://www.apache.org/licenses/LICENSE-2.0.html
Description: pepdata
        =======
        
        An important aspect of computational immunology is modeling the
        properties of `peptides <http://en.wikipedia.org/wiki/Peptide>`__ (short
        strings of amino acids). Peptides can arise as substrings
        `cut <http://en.wikipedia.org/wiki/Proteolysis>`__ out of a larger
        protein, naturally occurring `small
        proteins <http://en.wikipedia.org/wiki/Alpha-Amanitin>`__, or be
        `synthesized <micchm01.u.hpc.mssm.edu/dashboard/accounts/activate/e2b4804ac4d7e59dcff89a474d1971b8a36dff77/>`__
        for therapeutic purposes. To make useful clinical and research
        predictions (i.e. "which peptides should go in this vaccine?") we need
        to partition the combinatorial space of peptides into classes such as
        `T-cell epitopes <http://en.wikipedia.org/wiki/Epitope>`__ or
        `MHC <http://en.wikipedia.org/wiki/Major_histocompatibility_complex>`__
        ligands. One way to capture such distinctions is to collect large
        volumes of data about peptides and use that data to build statistical
        models of their immune properties. This library helps you build such
        models by providing simple Python/NumPy/Pandas interfaces to commonly
        used immunology and bioinformatics datasets.
        
        **Data Sources**
        
        -  ``iedb``: `Immune Epitope Database <http://www.iedb.org>`__, large
           collection of epitope assay results for MHC binding as well as
           T-cell/B-cell responses
        -  ``tcga``: Variant peptide substrings extracted from
           `TCGA <http://en.wikipedia.org/wiki/The_Cancer_Genome_Atlas>`__
           mutations across all cancer types
        -  ``reference``: Peptide substrings from the `human reference protein
           sequence <ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/pep/>`__
        -  ``imma2``: IMMA2 epitope immunogenic vs. non-immunogenic data set
           used by Tung et al. for evaluating the
           `POPISK <http://www.biomedcentral.com/1471-2105/12/446>`__
           immunogenicity predictor
        -  ``calis``: Two datasets used in Calis et al.'s `Properties of MHC
           Class I Presented Peptides That Enhance
           Immunogenicity <http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003266#pcbi.1003266.s005>`__
        -  ``hpv``: `Human Papillomavirus T cell Antigen
           Database <http://cvc.dfci.harvard.edu/cvccgi/hpv/>`__
        -  ``toxin``: Toxic protein sequences from `Animal Toxin
           Databse <http://protchem.hunnu.edu.cn/toxin/>`__
        -  ``danafarber``: `Dana Farber Repository for Machine Learning in
           Immunology <http://bio.dfci.harvard.edu/DFRMLI/>`__
        -  ``tantigen``: `Tumor T-cell Antigen
           Database <http://cvc.dfci.harvard.edu/tadb/>`__
        -  ``hiv_frahm``: Reactions to HIV epitopes across different ethnicities
           (from `LANL HIV
           Databases <http://www.hiv.lanl.gov/content/immunology/hlatem/study1/index.html>`__)
        -  ``cri_tumor_antigens``: Tumor associated peptides from `Cancer
           Immunity <http://cancerimmunity.org/peptide/mutations/>`__
        -  ``fritsch_neoepitopes``: Mutated and wildtype tumor epitopes from
           Fritsch et al. `HLA-binding properties of tumor neoepitopes in
           humans <http://cancerimmunolres.aacrjournals.org/content/early/2014/03/01/2326-6066.CIR-13-0227.abstract>`__
        
        Planned:
        
        -  ``bcipep``: `B-cell
           epitopes <http://www.imtech.res.in/raghava/bcipep/data.html>`__
        
        **Dataset API**
        
        When a dataset consists only of an unlabeled list of epitopes, then it
        only needs two functions: - ``load_wuzzle``: Returns set of amino acid
        strings - ``load_wuzzle_ngrams``: Array whose rows are amino acids
        transformed into n-gram vector space.
        
        If the dataset contains additional data about the epitopes (such as HLA
        type u or source protein): - ``load_wuzzle``: Returns data frame with
        epitope strings and additional properties - ``load_wuzzle_set``: Set of
        epitope amino acid strings - ``load_wuzzle_ngrams``: Array whose rows
        are amino acids transformed into n-gram vector space.
        
        If the dataset is labeled (contains positive and negative assay
        results), then the following functions should be available: -
        ``load_wuzzle``: Load all available data from the "wuzzle" dataset
        (filtered by options such as ``mhc_class``). - ``load_wuzzle_values``:
        Group the dataset by epitope string and associate each epitope with the
        positive and negative counts, along with percentage of positive results
        (in a column called "value"). - ``load_wuzzle_classes``: Split the
        epitopes into positive and negative classes, return a set of strings for
        each. - ``load_wuzzle_ngrams``: Transform the amino acid string
        representation (or some reduced alphabet) into vectors of n-gram
        frequencies, return a sklearn-compatible ``(samples, labels)`` pair of
        arrays.
        
        **Amino Acid Properties**
        
        The ``amino_acid`` module contains a variety of physical/chemical
        properties for both single amino residues and interactions between pairs
        of residues.
        
        Single residue feature tables are parsed into ``StringTransformer``
        objects, which can be treated as dictionaries or will vectorize a string
        when you call their method ``transform_string``.
        
        Examples of single residue features: - ``hydropathy`` - ``volume`` -
        ``polarity`` - ``pK_side_chain`` - ``prct_exposed_residues`` -
        ``hydrophilicity`` - ``accessible_surface_area`` - ``refractivity`` -
        ``local_flexibility`` - ``accessible_surface_area_folded`` -
        ``alpha_helix_score`` (Chou-Fasman) - ``beta_sheet_score`` (Chou-Fasman)
        - ``turn_score`` (Chou-Fasman)
        
        Pairwise interaction tables are parsed into nested dictionaries, so that
        the interaction between amino acids ``x`` and ``y`` can be determined
        from ``d[x][y]``.
        
        Pairwise interaction dictionaries: - ``strand_vs_coil`` (and its
        transpose ``coil_vs_strand``) - ``helix_vs_strand`` (and its transpose
        ``strand_vs_helix``) - ``helix_vs_coil`` (and its transpose
        ``coil_vs_helix``) - ``blosum30`` - ``blosum50`` - ``blosum62``
        
        There is also a function to parse the coefficients of the `PMBEC
        similarity matrix <http://www.biomedcentral.com/1471-2105/10/394>`__,
        though this currently lives in the separate ``pmbec`` module.
        
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
