Metadata-Version: 2.1
Name: str-analysis
Version: 1.2.9
Summary: Utilities for analyzing short tandem repeats (STRs)
Home-page: https://github.com/broadinstitute/str-analysis
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: google-cloud-storage>=2.14.0
Requires-Dist: hail>=0.2.127
Requires-Dist: ijson>=3.2.2
Requires-Dist: intervaltree>=3.1.0
Requires-Dist: lxml>=4.9.3
Requires-Dist: numpy>=1.20.3
Requires-Dist: pandas>=1.1.4
Requires-Dist: pyfaidx>=0.6.4
Requires-Dist: pybedtools>=0.9.0
Requires-Dist: pyBigWig>=0.3.22
Requires-Dist: pysam>=0.16.0.1
Requires-Dist: requests>=2.25.1
Requires-Dist: simplejson>=3.19.2
Requires-Dist: tqdm>=4.62.3

# str-analysis
---
This repo contains scripts and utilities for analyzing tandem repeats (TRs). 

* Tools:
  * **call_non_ref_motifs** ([docs](https://github.com/broadinstitute/str-analysis/blob/main/docs/call_non_ref_motifs.md)) - takes a bam/cram file and, optionally, an ExpansionHunter variant catalog. Then, for each 
    locus, it determines which STR motifs are supported by reads overlapping that locus before running ExpansionHunter on the motif(s) it detected. 
  * **filter_vcf_to_STR_variants** - takes a single-sample VCF file and filters it to the INS/DEL variants that represent
    tandem repeat expansions or contractions by peforming brute-force k-mer search on each variant's inserted or deleted 
    bases. This tool was a core part of [Weisburd, B., Tiao, G. & Rehm, H. L. Insights from a genome-wide truth set of tandem repeat variation. (2023)](https://www.biorxiv.org/content/10.1101/2023.05.05.539588v1)
  * **merge_loci** - takes one or more STR catalogs and combines them into a single catalog while removing
    duplicates based on overlap and repeat motif. 
  * **annotate_and_filter_str_catalog** - takes an STR catalog and annotates the loci based on their overlap with genes  
    and known disease associated STRs. It then allows filtering by motif size, gene region, and various other criteria.
  * **compute_catalog_stats** - takes an annotated catalog output by the *annotate_and_filter_str_catalog* script and 
    computes various summary statistics about it.
  * **add_offtarget_regions** - takes an ExpansionHunter variant catalog and adds a list of off-target regions to each
    locus definition by querying a database of off-target regions that have been precomputed for each TR motif.
    This database was generated by using wgsim to simulate fully-repetitive reads for each motif, and then recording
    where these reads mapped on hg19 and hg38 after aligning them using bwa. 
  * **add_adjacent_loci_to_expansion_hunter_catalog** - takes an ExpansionHunter variant catalog and a bed file containing 
    all simple repeats in the reference genome. Outputs a new catalog with updated LocusStructures and ReferenceRegions 
    that include any adjacent repeats found near each locus in the input catalog.   
  * **check_trios_for_mendelian_violations** - takes a table of combined ExpanssionHunter calls generated by the 
  * **combine_str_json_to_tsv** script (see below) as well as a FAM file. Outputs a new table indicating which calls 
    were transmitted without expansion or contraction, and which were mendelian violations. 
  * **simulate_str_expansions** - uses wgsim to generate .bam files with simulated read data containing STR expansions 
    at a given locus, and having a given number of repeats, motif, zygosity, etc.


* ExpansionHunterDenovo output post-processing:
  * **annotate_EHdn_locus_outliers** - takes an ExpansionHunterDenovo outlier result table (locus outliers or case-control)
    as well as a bed file containing all simple repeats in the reference genome and, optionally, a gene models GTF file, 
    a variant catalog of known-disease associated loci, and/or other bed files with genomic regions of interest. 
    Outputs a new table where each EHdn outlier is annotated with multiple columns related to the provided reference data.
  * **convert_annotated_EHdn_locus_outliers_to_expansion_hunter_catalog** - takes the output table from 
    **annotate_EHdn_locus_outliers** and lets the user apply a range of filters before 
    writing out the passing loci to an ExpansionHunter variant catalog.


* gnomAD STR calls:
  * **generate_gnomad_json** - was used to combine the gnomAD STR calls into the files
    available for [download on the gnomAD website](https://gnomad.broadinstitute.org/downloads#v3-short-tandem-repeats).


* post-process and combine ExpansionHunter outputs:
  * **combine_str_json_to_tsv** - takes a set of ExpansionHunter json output files and combines them into a single tsv table.
  * **combine_json_to_tsv** - takes a set of arbitrary json files that share the same schema and combines their top-level fields into a single tsv file.
  * **copy_EH_vcf_fields_to_json** - takes the ExpansionHunter output vcf and json file for a given sample and copies fields that are only present in the vcf to the json file.
  * **run_reviewer** - takes ExpansionHunter output files for a single sample and runs REViewer on the subset of loci where the genotypes exceed locus-specific thresholds specified in the variant catalog. 

* format converters:
  * **convert_bed_to_expansion_hunter_variant_catalog** 
  * **convert_expansion_hunter_variant_catalog_to_gangstr_spec** 
  * **convert_gangstr_spec_to_expansion_hunter_variant_catalog**
  * **convert_expansion_hunter_denovo_locus_tsv_to_bed**
  * **convert_gangstr_vcf_to_expansion_hunter_json** 
  * **convert_hipstr_vcf_to_expansion_hunter_json**
  * **convert_strling_calls_to_expansion_hunter_json** 



## Installation

To install using pip, run:

```
python3 -m pip install --upgrade str_analysis
```

or use the docker image:

```
docker run -it weisburd/str-analysis:latest
```
