Metadata-Version: 2.3
Name: mimeo
Version: 1.1.2
Summary: Scan genomes for internally repeated sequences, elements which are repetitive in another species, or high-identity HGT candidate regions between species.
Project-URL: homepage, https://github.com/adamtaranto/mimeo
Project-URL: documentation, https://github.com/adamtaranto/mimeo
Project-URL: repository, https://github.com/adamtaranto/mimeo
Author-email: Adam Taranto <adam.p.taranto@gmail.com>
License: MIT
License-File: LICENSE
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8
Requires-Dist: argparse-tui
Requires-Dist: biopython>=1.70
Requires-Dist: pandas>=0.20.3
Provides-Extra: tests
Requires-Dist: pytest; extra == 'tests'
Description-Content-Type: text/markdown

# Mimeo

# Table of contents

* [Modules](#modules)
* [Installing Mimeo](#installing-mimeo)
* [Example usage](#example-usage)
* [Standard options](#standard-options)
  *  [mimeo-self](#mimeo-self)
  *  [mimeo-x](#mimeo-x)
  *  [mimeo-map](#mimeo-map)
  *  [mimeo-filter](#mimeo-filter)
* [Alternative alignment engines](#importing-alignments)
* [License](#license)

# Modules

Mimeo comprises three tools for parsing repeats from whole-genome alignments:

## mimeo-self  

**Internal repeat finder.** Mimeo-self aligns a genome to itself and extracts high-identity segments above
a coverage threshold. This method is less sensitive to disruption by indels and repeat-directed point mutations than
kmer-based methods such as RepeatScout. Reported annotations indicate overlapping segments above the coverage threshold,
mimeo-self does not attempt to separate nested repeats. Use this tool to identify candidate repeat regions for curated annotation.

## mimeo-x  

**Cross-species repeat finder.** A newly acquired or low-copy transposon may slip past copy-number based annotation tools. Mimeo-x searches for features which are abundant in an external reference genome, allowing for
annotation of complete elements as they occur in a horizontal-transfer donor species, or of conserved coding segments
of related transposon families.

## mimeo-map  

**Find all high-identity segments shared between genomes.** Mimeo-map identifies candidate horizontally
transferred segments between sufficiently diverged species. When comparing isolates of a single species, aligned segments correspond to directly homologous sequences and internally repetitive features.  


Intra/Inter-genomic alignments from Mimeo-self or Mimeo-x can be reprocessed with Mimeo-map to generate annotations of
unfiltered/uncollapsed alignments. These raw alignment annotations can be used to interrogate repetitive-segments for coverage breakpoints corresponding to nested transposons with differing abundances across the genome.  

## mimeo-filter

An additional tool **mimeo-filter** is now included to allow post-filtering of SSR-rich sequences from FASTA formatted
candidate-repeat libraries.  


# Installing Mimeo

Requirements: 
  * [LASTZ](http://www.bx.psu.edu/~rsharris/lastz/) genome alignment tool from the Miller Lab, Penn State.
  * [bedtools](http://bedtools.readthedocs.io/en/latest/content/installation.html)
  * [trf](https://tandem.bu.edu/trf/trf.html)


Install from Bioconda:
```bash
conda install mimeo
```

Install from PyPi:
```bash
pip install mimeo
```

Clone and install from this repository:
```bash
git clone https://github.com/Adamtaranto/mimeo.git && cd mimeo && pip install -e .
```

# Example usage 

### mimeo-self

Annotate features in genome A which are > 100bp and occur with >=
80% identity at least 3 times on other scaffolds OR at least 4 times
on the same scaffold.

```bash
mimeo-self --adir data/A_genome_Split --afasta data/A_genome.fasta \
-d MS_outdir --gffout A_genome_Inter3_Intra4_id80_len_100.gff3 \
--outfile A_genome_Self_Align.tab --label A_Rep3 --prefix A_Self --minIdt 80 \
--minLen 100 --minCov 3 --intraCov 4 --strictSelf
```

Output: 
  - MS_outdir/A_genome_Inter3_Intra4_id80_len_100.gff3
  - MS_outdir/A_genome_Self_Align.tab
  - data/A_genome_Split/*.fa

### mimeo-x

Annotate features in genome A which are > 100bp and occur with >=
80% identity at least 5 times in genome B.

```bash
mimeo-x --afasta data/A_genome.fasta --bfasta data/B_genome.fasta \
-d MX_outdir --gffout B_Rep5_in_A.gff3 --outfile B_Reps_in_A_id80_len100.tab \
--label B_Rep5 --prefix B_Rep5 --minIdt 80 --minLen 100 --minCov 5
```

Output: 
  - MX_outdir/B_Rep5_in_A.gff3
  - MX_outdir/B_Reps_in_A_id80_len100.tab

### mimeo-map

Annotate features in genome A which are > 100bp and occur with >=
90% identity in genome B. No coverage filter, all alignments are reported.

```bash
mimeo-map --afasta data/A_genome.fasta --bfasta data/B_genome.fasta \
-d MM_outdir --gffout B_in_A_id90.gff3 --outfile B_in_A_id90.tab \
--label B_90 --prefix B_90 --minIdt 90 --minLen 100 
```

Output: 
  - MM_outdir/B_in_A_id90.gff3
  - MM_outdir/B_in_A_id90.tab

### mimeo-map + SSR filter

Annotate features in genome A which are > 100bp and occur with >=
98% identity in genome B. Reuse B to A-genome alignment from the previous run.

Filter out hits which are >= 40% tandem repeats. Write filtered hits
as tab file and GFF3 annotation.

```bash
mimeo-map --afasta data/A_genome.fasta --bfasta data/B_genome.fasta \
-d MM_outdir --gffout B_in_A_id98_maxSSR40.gff3 --outfile B_in_A_id98.tab \
--label B_98 --prefix B_98 --minIdt 98 --minLen 100 \
--recycle --maxtandem 40 --writeTRF
```

Output: 
  - MM_outdir/B_in_A_id98_maxSSR40.gff3
  - MM_outdir/B_in_A_id98.tab.trf

### mimeo-filter

Filter sequences comprised of >= 40% short tandem repeats from a multifasta
library of candidate transposons.

```bash
mimeo-filter --infile data/candidate_TEs.fa
```

Output:
  - candidate_TEs_filtered.fa


# Standard options

### mimeo-self

```
Usage: mimeo-self [-h] [--adir ADIR] [--afasta AFASTA] [-r] [-d OUTDIR]
                  [--gffout GFFOUT] [--outfile OUTFILE] [--verbose]
                  [--label LABEL] [--prefix PREFIX] [--lzpath LZPATH]
                  [--bedtools BEDTOOLS] [--minIdt MINIDT] [--minLen MINLEN]
                  [--minCov MINCOV] [--hspthresh HSPTHRESH]
                  [--intraCov INTRACOV] [--strictSelf]

Internal repeat finder. Mimeo-self aligns a genome to itself and extracts
high-identity segments above a coverage threshold.

Optional arguments:
  -h, --help      Show this help message and exit.
  --adir          Name of the directory containing sequences from the genome.
                  Write split files here if providing genome as
                  multifasta.
  --afasta        Genome as multifasta.
  -r, --recycle   Use existing alignment "--outfile" if found.
  -d , --outdir   Write output files to this directory. (Default: cwd)
  --gffout        Name of GFF3 annotation file.
  --outfile       Name of alignment result file.
  --verbose       If set report LASTZ progress.
  --label         Set annotation TYPE field in gff.
  --prefix        ID prefix for internal repeats.
  --lzpath        Custom path to LASTZ executable if not in $PATH.
  --bedtools      Custom path to bedtools executable if not in $PATH.
  --minIdt        Minimum alignment identity to report.
  --minLen        Minimum alignment length to report.
  --minCov        Minimum depth of aligned segments to report repeat
                  feature.
  --hspthresh     Set HSP min score threshold for LASTZ.
  --intraCov      Minimum depth of aligned segments from the same scaffold
                  to report feature. Used if "--strictSelf" mode is
                  selected.
  --strictSelf    If set process same-scaffold alignments separately
                  with the option to use higher "--intraCov" threshold.
                  Sometimes useful to avoid false repeat calls from
                  staggered alignments over SSRs or short tandem
                  duplication.
```

### mimeo-x

```
Usage: mimeo-x [-h] [--adir ADIR] [--bdir BDIR] [--afasta AFASTA]
               [--bfasta BFASTA] [-r] [-d OUTDIR] [--gffout GFFOUT]
               [--outfile OUTFILE] [--verbose] [--label LABEL]
               [--prefix PREFIX] [--lzpath LZPATH] [--bedtools BEDTOOLS]
               [--minIdt MINIDT] [--minLen MINLEN] [--minCov MINCOV]
               [--hspthresh HSPTHRESH]

Cross-species repeat finder. Mimeo-x searches for features which are abundant
in an external reference genome.

Optional arguments:
  -h, --help      Show this help message and exit.
  --adir          Name of the directory containing sequences from A genome.
  --bdir          Name of the directory containing sequences from B genome.
  --afasta        A genome as multifasta.
  --bfasta        B genome as multifasta.
  -r, --recycle   Use existing alignment "--outfile" if found.
  -d , --outdir   Write output files to this directory. (Default: cwd)
  --gffout        Name of GFF3 annotation file.
  --outfile       Name of alignment result file.
  --verbose       If set report LASTZ progress.
  --label         Set annotation TYPE field in GFF.
  --prefix        ID prefix for B-genome repeats annotated in A-genome.
  --lzpath        Custom path to LASTZ executable if not in $PATH.
  --bedtools      Custom path to bedtools executable if not in $PATH.
  --minIdt        Minimum alignment identity to report.
  --minLen        Minimum alignment length to report.
  --minCov        Minimum depth of B-genome hits to report feature in
                  A-genome.
  --hspthresh     Set HSP min score threshold for LASTZ.
```

### mimeo-map

```
Usage: mimeo-map [-h] [--adir ADIR] [--bdir BDIR] [--afasta AFASTA]
                 [--bfasta BFASTA] [-r] [-d OUTDIR] [--gffout GFFOUT]
                 [--outfile OUTFILE] [--verbose] [--label LABEL]
                 [--prefix PREFIX] [--keeptemp] [--lzpath LZPATH]
                 [--minIdt MINIDT] [--minLen MINLEN] [--hspthresh HSPTHRESH]
                 [--TRFpath TRFPATH] [--tmatch TMATCH] [--tmismatch TMISMATCH]
                 [--tdelta TDELTA] [--tPM TPM] [--tPI TPI]
                 [--tminscore TMINSCORE] [--tmaxperiod TMAXPERIOD]
                 [--maxtandem MAXTANDEM] [--writeTRF]

Find all high-identity segments shared between genomes.

Optional arguments:
  -h, --help      Show this help message and exit.
  --adir          Name of the directory containing sequences from A genome.
  --bdir          Name of the directory containing sequences from B genome.
  --afasta        A genome as multifasta.
  --bfasta        B genome as multifasta.
  -r, --recycle   Use existing alignment "--outfile" if found.
  -d, --outdir    Write output files to this directory. (Default: cwd)
  --gffout        Name of GFF3 annotation file. If not set, suppress
                  output.
  --outfile       Name of alignment result file.
  --verbose       If set report LASTZ progress.
  --label         Set annotation TYPE field in GFF.
  --prefix        ID prefix for B-genome hits annotated in A-genome.
  --keeptemp      If set does not remove temp files.
  --lzpath        Custom path to LASTZ executable if not in $PATH.
  --minIdt        Minimum alignment identity to report.
  --minLen        Minimum alignment length to report.
  --hspthresh     Set HSP min score threshold for LASTZ.
  --TRFpath       Custom path to TRF executable if not in $PATH.
  --tmatch        TRF matching weight.
  --tmismatch     TRF mismatching penalty.
  --tdelta        TRF indel penalty.
  --tPM           TRF match probability.
  --tPI           TRF indel probability.
  --tminscore     TRF minimum alignment score to report.
  --tmaxperiod    TRF maximum period size to report.
  --maxtandem     Max percentage of an A-genome alignment which may be masked by TRF. 
                  If exceeded, the alignment will be discarded.
  --writeTRF      If set write TRF filtered alignment file for use with
                  other mimeo modules.
```


### mimeo-filter

```
Usage: mimeo-filter [-h] --infile INFILE [-d OUTDIR] [--outfile OUTFILE]
                    [--keeptemp] [--verbose] [--TRFpath TRFPATH]
                    [--tmatch TMATCH] [--tmismatch TMISMATCH]
                    [--tdelta TDELTA] [--tPM TPM] [--tPI TPI]
                    [--tminscore TMINSCORE] [--tmaxperiod TMAXPERIOD]
                    [--maxtandem MAXTANDEM]

Filter SSR containing sequences from FASTA library of repeats.

Optional arguments:
  -h, --help            Show this help message and exit.
  --infile              Name of the directory containing sequences from A genome.
  -d, --outdir          Write output files to this directory. (Default: cwd)
  --outfile             Name of alignment result file.
  --keeptemp            If set does not remove temp files.
  --verbose             If set report LASTZ progress.
  --TRFpath             Custom path to TRF executable if not in $PATH.
  --tmatch              TRF matching weight
  --tmismatch           TRF mismatching penalty.
  --tdelta              TRF indel penalty.
  --tPM                 TRF match probability.
  --tPI                 TRF indel probability.
  --tminscore           TRF minimum alignment score to report.
  --tmaxperiod          TRF maximum period size to report. Note: Setting this
                        score too high may exclude some LTR retrotransposons.
                        Optimal len to exclude only SSRs is 10-50bp.
  --maxtandem           Max percentage of a sequence which may be masked by
                        TRF. If exceeded, the element will be discarded.

```


# Importing alignments

Whole genome alignments generated by alternative tools (i.e. BLAT) can be provided to any of the Mimeo modules
as a tab-delimited file with the columns:

```
[1]   name1     = Name of target sequence in genome A
[2]   strand1   = Strand of alignment in target sequence
[3]   start1    = 5-prime position of alignment in target (lower value irrespective of strand)
[4]   end1      = 3-prime position of alignment in target (higher value irrespective of strand)
[5]   name2     = Name of source sequence in genome B
[6]   strand2   = Strand of alignment in source
[7]   start2+   = 5-prime position of alignment in source (lower value irrespective of strand)
[8]   end2+     = 3-prime position of alignment in source (higher value irrespective of strand)
[9]   score     = Alignment score as int
[10]  identity  = Identity of alignment as float
```

File should be sorted by columns 1,3,4

# License

Software provided under MIT license.