Metadata-Version: 2.1
Name: gcgc
Version: 0.9.2.dev1
Summary: GCGC is a preprocessing library for biological sequence model development.
Home-page: https://github.com/tshauck/gcgc
Author: Trent Hauck
Author-email: trent@trenthauck.com
License: MIT
Platform: UNKNOWN
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.6
Description-Content-Type: text/markdown

# GCGC

> GCGC is a python package for feature processing on Biological Sequences.

[![](https://img.shields.io/pypi/v/gcgc.svg)](https://pypi.python.org/pypi/gcgc)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.2329966.svg)](https://doi.org/10.5281/zenodo.2329966)

## Installation

Install GCGC via pip:

```sh
$ pip install gcgc
```

## Documentation

The GCGC documentation is at [gcgc.trenthauck.com](http://gcgc.trenthauck.com),
please see it for an example.

## Citing GCGC

If you use GCGC in your research, cite it with the following:

```
@misc{trent_hauck_2018_2329966,
  author       = {Trent Hauck},
  title        = {GCGC},
  month        = dec,
  year         = 2018,
  doi          = {10.5281/zenodo.2329966},
  url          = {https://doi.org/10.5281/zenodo.2329966}
}
```


# Changelog

## 0.10.0 (2019-11-09)

`gcgc` has been revamped quite a bit to better support existing processing
pipelines for NLP without trying to do to much. See the docs for more
information about how this works.

## 0.9.0 (2019-08-05)

### Added

- Parser now outputs the length of the tensor not including padding. This is
  useful for packing and length based iteration.
- Generating masked output from the parse_record method is now available.
- Alphabet can include an optional mask token.

### Changed

- Can now specify how large of kmer step size to generate when supplying a kmer
  value.
- Renames EncodedSeq.integer_encoded to EncodedSeq.get_integer_encoding which
  takes a kmer_step_size to specify how large of steps to take when encoding.
- Add parsed_seq_len to the SequenceParser object to control how much padding to
  apply to the end of the integer encoded sequence. This is useful since a batch
  of tensors is expected to have the same size.

## 0.8.0 (2019-07-04)

### Fixed

- Broken test due to platform differences in `Path.glob` sorting.

### Added

- User can specify to use start or end tokens optionally.

### Removed

- Removed one_hot_encoding. The user can do that pretty easily if needed. E.g.
  see `scatter` in PyTorch.

## 0.7.0 (2019-06-22)

### Added

- Properties to access the integer encodings of special tokens. (35cae2a)
  - `Alphabet.encoded_start`
  - `Alphabet.encoded_end`
  - `Alphabet.encoded_padding`
- Remove uniprot dataset creation. (e233162)
- Simplify index handling for GenomicDataset. (3213a9e)

## 0.6.1 (2019-06-10)

### Added

- Updated package management so gcgc is easier to use with other version of
  torch.

## 0.6.0 (2019-04-04)

### Added

- Ability for kmer size to be passed to an alphabet.

## 0.5.2 (2019-03-21)

### Added

- Add Dockerfile and docker-compose.yml for development.
- `EncodedSeq.shift`, which will shift sequence by an offset integer.
- `EncodedSeq.from_integer_encoded_seq` will take a list of integers and an
  alphabet and return an EncodedSeq object.
- Add the ability to apply a function to the rollout_kmers yielded values.

### Changed

- Alphabet special characters are now located at the start, rather than the end,
  of the letters and token sequence.

## 0.5.1 (2019-01-09)

### Added

- Add extra css to make underline links in articles.
- Exit if the download directory doesn't exist in the call to download organism.
- Wording improvements in docs.

## 0.5.0 (2018-12-31)

### Added

- Include `seq_tensor_one_hot` in the PyTorch Parser.
- Added a `GCGCRecord.encoded_seq` property.
- New `gcgc.random` module to start holding sequence data.
- New `gcgc.rollout` module to handle working through chunks of sequences.
  - `rollout_kmers` will roll out [kmers][1].
  - `rollout_seq_features` will roll out the `SeqFeatures` from a `SeqRecord`.
- `EncodingAlphabet` now can optionally take a `gap_characters` set of characters to add to the
  alphabet letters. It also takes `add_lower_case_for_inserts` which will duplicate the alphabet,
  but convert the letters to lowercase.

### Changed

### Fixed

- Fixed bug in `GenomicDataset.from_path` where it still referred to `init_from_path_generator`.

## 0.4.0

### Added

- `EncodedSeq` now supports iterating through kmers, see `EncodedSeq.rollout_kmers` for options.
- GCGC is citable.
- GCGC now has a CHANGELOG.md.

[1]: https://en.wikipedia.org/wiki/K-mer


