Metadata-Version: 2.1
Name: mmproteo
Version: 0.2.2
Summary: Mirko meets Proteomics: A PRIDE downloader on steroids for deep learning-based DeNovo Sequencing
Home-page: https://gitlab.com/dacs-hpi/pride-downloader
Author: Mirko Krause
Author-email: krause@codebase.one
License: GPLv3+
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Environment :: Console
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: requests (>=2.22.0)
Requires-Dist: pandas (~=1.1.3)
Requires-Dist: pyteomics (>=4.4.0)
Requires-Dist: wget (>=3.2)
Requires-Dist: pyarrow (>=2.0.0)
Requires-Dist: lxml (>=4.5.0)

# MMProteo

Mirko meets Proteomics




## Getting started

### Requirements

* Python 3.8 (other versions haven't been tested yet)
* preferably Linux (other OSes haven't been tested yet)

If you want, set up a virtual environment to encapsulate this software and its dependencies:

```
virtualenv --python=/usr/bin/python3 venv
source venv/bin/activate
```

### Installation
#### ...from Pip

`pip install mmproteo`

#### ...from Source Code

```
cd mmproteo
python setup.py install
```

#### ...as editable Development Installation from Source Code

```
cd mmproteo
pip install -e .
```

### Running MMProteo

The installations will make `mmproteo` available on your command line (and in your PATH),
so preferably use this anywhere you want.

Alternatively, you can run the `mmproteo-runner.py` script in the mmproteo directory.

If you are crazy, you can also run the `mmproteo/mmproteo/mmproteo.py` script, but this one is 
already part of the installed mmproteo package and shouldn't really be used directly.

### Usage

Simply entering `mmproteo` will already show you the available parameters.
Using `mmproteo --help` you can get a better image of the functionality.
Currently, the following parameters are available:

```
$ mmproteo -h
usage: mmproteo [-h] [-p PRIDE_PROJECT] [-n MAX_NUM_FILES]
                [--count-failed-files] [-d STORAGE_DIR] [-l LOG_FILE]
                [--log-to-stdout] [-t VALID_FILE_EXTENSIONS] [-e] [-x] [-v]
                [--shown-columns SHOWN_COLUMNS]
                {download,info,ls} [{download,info,ls} ...]

positional arguments:
  {download,info,ls}    the list of actions to be performed on the repository.
                        Every action can only occur once. Duplicates are
                        dropped after the first occurrence.

optional arguments:
  -h, --help            show this help message and exit
  -p PRIDE_PROJECT, --pride-project PRIDE_PROJECT
                        the name of the PRIDE project, e.g. 'PXD010000' from '
                        https://www.ebi.ac.uk/pride/ws/archive/peptide/list/pr
                        oject/PXD010000'.For some commands, this parameter is
                        required. (default: None)
  -n MAX_NUM_FILES, --max-num-files MAX_NUM_FILES
                        the maximum number of files to be downloaded. Set it
                        to '0' to download all files. (default: 0)
  --count-failed-files  Count failed files and do not just skip them. This is
                        relevant for the max-num-files parameter. (default:
                        True)
  -d STORAGE_DIR, --storage-dir STORAGE_DIR
                        the name of the directory, in which the downloaded
                        files and the log file will be stored. (default:
                        ./pride)
  -l LOG_FILE, --log-file LOG_FILE
                        the name of the log file, relative to the download
                        directory. (default: downloader.log)
  --log-to-stdout       Log to stdout instead of stderr. (default: False)
  -t VALID_FILE_EXTENSIONS, --valid-file-extensions VALID_FILE_EXTENSIONS
                        a list of comma-separated allowed file extensions to
                        filter files for. An empty list deactivates filtering.
                        Capitalization does not matter. (default: )
  -e, --no-skip-existing
                        Do not skip existing files. (default: False)
  -x, --no-extract      Do not try to extract downloaded files. (default:
                        False)
  -v, --verbose         Increase output verbosity to debug level. (default:
                        False)
  --shown-columns SHOWN_COLUMNS
                        a list of comma-separated column names. Some commands
                        show their results as tables, so their output columns
                        will be limited to those in this list. An empty list
                        deactivates filtering. Capitalization matters.
                        (default: )
```

There are multiple commands (positional arguments) as well as several optional arguments available.
A single command doesn't require or use all the available arguments. However, one can provide multiple
commands (separated by spaces), which will all use their share of the given arguments.

### Examples

#### Request Project Information for the Project PXD010000

```
$ mmproteo -p PXD010000 info
2021-01-21 23:02:14,035 - MMProteo: Logging to <stderr> and to file ./pride/downloader.log
2021-01-21 23:02:14,035 - MMProteo: Requesting project summary from https://www.ebi.ac.uk:443/pride/ws/archive/project/PXD010000
2021-01-21 23:02:15,847 - MMProteo: Received project summary for project "PXD010000"
{
    "accession": "PXD010000",
    "title": "DeNovo Peptide Identification Deep Learning Training Set",
    "projectDescription": "A benchmark set of bottom-up proteomics data for training deep learning networks. It has data from 51 organisms and includes nearly 1 million peptides.",
    "publicationDate": "2018-06-13",
    "submissionType": "COMPLETE",
    "numAssays": 235,
    "species": [
        "Delftia acidovorans (strain DSM 14801 / SPH-1)",
        "Francisella tularensis subsp. novicida (strain U112)",
        "Cupriavidus necator (strain ATCC 43291 / DSM 13513 / N-1) (Ralstonia eutropha)",
        "Rhizobium radiobacter (Agrobacterium tumefaciens) (Agrobacterium radiobacter)",
        "Shewanella oneidensis (strain MR-1)",
        "Bacteroides thetaiotaomicron (strain ATCC 29148 / DSM 2079 / NCTC 10582 / E50 / VPI-5482)",
        "Synechococcus elongatus (strain PCC 7942) (Anacystis nidulans R2)",
        "bacteria",
        "Micrococcus luteus (Micrococcus lysodeikticus)",
        "Methylomicrobium alcaliphilum (strain DSM 19304 / NCIMB 14124 / VKM B-2133 / 20Z)",
        "Mycobacterium smegmatis",
        "Rhodopseudomonas palustris",
        "Listeria monocytogenes serotype 1/2a (strain 10403S)",
        "Alcaligenes faecalis",
        "Legionella pneumophila",
        "Lactobacillus casei subsp. casei ATCC 393",
        "Acidiphilium cryptum (strain JF-5)",
        "Bacillus cereus (strain ATCC 14579 / DSM 31)",
        "Citrobacter freundii",
        "Chryseobacterium indologenes",
        "Myxococcus xanthus DZ2",
        "Fibrobacter succinogenes subsp. succinogenes S85",
        "Bacteroides fragilis (strain 638R)",
        "Rhodococcus sp. (strain RHA1)",
        "Clostridium ljungdahlii (strain ATCC 55383 / DSM 13528 / PETC)",
        "Coprococcus comes ATCC 27758",
        "Bacillus subtilis subsp. subtilis str. NCIB 3610",
        "Bacillus subtilis subsp. subtilis str. 168",
        "Anaerococcus hydrogenalis DSM 7454",
        "Algoriphagus marincola HL-49",
        "Stigmatella aurantiaca (strain DW4/3-1)",
        "Streptococcus agalactiae",
        "Bifidobacterium longum subsp. infantis (strain ATCC 15697 / DSM 20088 / JCM 1222 / NCTC 11817 / S12)",
        "Sulfobacillus thermosulfidooxidans",
        "Paenibacillus polymyxa ATCC 842",
        "Faecalibacterium prausnitzii SL3/3",
        "Cyanobacterium stanieri",
        "Cellvibrio gilvus (strain ATCC 13127 / NRRL B-14078)",
        "Dorea formicigenerans",
        "Streptomyces griseorubens",
        "Streptomyces sp.",
        "Bifidobacterium bifidum DSM 20456 = JCM 1255",
        "Paracoccus denitrificans",
        "Prevotella ruminicola (strain ATCC 19189 / JCM 8958 / 23)",
        "Campylobacter jejuni",
        "Ruminococcus gnavus ATCC 29149",
        "Pseudomonas putida KT2440",
        "Cellulophaga baltica 18"
    ],
    "tissues": [],
    "ptmNames": [
        "Oxidation"
    ],
    "instrumentNames": [
        "Q Exactive"
    ],
    "projectTags": [],
    "doi": "10.6019/PXD010000",
    "submitter": {
        "title": "Dr",
        "firstName": "Matthew",
        "lastName": "Monroe",
        "email": "matthew.monroe@pnnl.gov",
        "affiliation": "Pacific Northwest National Laboratory"
    },
    "labHeads": [
        {
            "title": "Dr",
            "firstName": "Samuel",
            "lastName": "Payne",
            "email": "samuel.payne@pnnl.gov",
            "affiliation": "Pacific Northwest National Laboratory"
        }
    ],
    "submissionDate": "2018-06-12",
    "reanalysis": "PXD005851",
    "experimentTypes": [
        "Shotgun proteomics"
    ],
    "quantificationMethods": [],
    "keywords": "machine learning, deep learning, bacterial diversity",
    "sampleProcessingProtocol": "Samples were digested with trypsin then analyzed by LC-MS/MS",
    "dataProcessingProtocol": "Data was searched with MSGF+ using PNNL's DMS Processing pipeline",
    "otherOmicsLink": null,
    "numProteins": 1494431,
    "numPeptides": 10324593,
    "numSpectra": 11774613,
    "numUniquePeptides": 7496530,
    "numIdentifiedSpectra": 0,
    "references": []
}

```

#### List all Information about 2 Files of Project PXD010000

```
$ mmproteo -p PXD010000 -n 2 ls
2021-01-21 23:06:08,280 - MMProteo: Logging to <stderr> and to file ./pride/downloader.log
2021-01-21 23:06:08,280 - MMProteo: Requesting list of project files from https://www.ebi.ac.uk/pride/ws/archive/file/list/project/PXD010000
2021-01-21 23:06:08,852 - MMProteo: Received list of 1175 project files
2021-01-21 23:06:08,852 - MMProteo: Showing only the top 2 entries because of the max_num_files parameter
  projectAccession assayAccession fileType fileSource   fileSize                                                                             fileName                                                                                                                                        downloadLink                                                                                                                                asperaDownloadLink
0        PXD010000          93137   RESULT  SUBMITTED   98085358  Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39_msgfplus.mzid.gz  ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2018/06/PXD010000/Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39_msgfplus.mzid.gz  prd_ascp@fasp.ebi.ac.uk:pride/data/archive/2018/06/PXD010000/Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39_msgfplus.mzid.gz
1        PXD010000          93137     PEAK  SUBMITTED  340204025           Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39.mzML.gz           ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2018/06/PXD010000/Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39.mzML.gz           prd_ascp@fasp.ebi.ac.uk:pride/data/archive/2018/06/PXD010000/Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39.mzML.gz
```

#### Show selected Information about 5 RAW or mzML files of Project PXD010000

```
$ mmproteo -p PXD010000 -n 5 --shown-columns "projectAccession,fileName,fileSize" -t "raw,mzml" ls
2021-01-21 23:13:57,067 - MMProteo: Logging to <stderr> and to file ./pride/downloader.log
2021-01-21 23:13:57,067 - MMProteo: Requesting list of project files from https://www.ebi.ac.uk/pride/ws/archive/file/list/project/PXD010000
2021-01-21 23:13:57,679 - MMProteo: Received list of 1175 project files
2021-01-21 23:13:57,679 - MMProteo: Filtering repository files based on the following required file extensions ["mzml", "raw"] and the following optional file extensions ["zip", "gz"]
2021-01-21 23:13:57,680 - MMProteo: Showing only the top 5 entries because of the max_num_files parameter
2021-01-21 23:13:57,681 - MMProteo: Limiting the shown columns according to the shown_columns parameter
   projectAccession                                                                    fileName    fileSize
1         PXD010000  Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39.mzML.gz   340204025
3         PXD010000      Biodiversity_S_thermosulf_FeYE_anaerobic_2_01Jun16_Pippin_16-03-39.raw   479591415
6         PXD010000                               Cj_media_MH_R5_23Feb15_Arwen_14-12-03.mzML.gz   578988435
8         PXD010000                                   Cj_media_MH_R5_23Feb15_Arwen_14-12-03.raw  1842140687
11        PXD010000                               Cj_media_MH_R4_23Feb15_Arwen_14-12-03.mzML.gz   698723177
```

Recognize, that the archive extension `.gz` of the mzML files as well as the capitalization of `mzML` 
is ignored while filtering the files. The same would state for the `.zip` extension. Other archive formats
are not supported yet.

## Tests



