Metadata-Version: 1.0
Name: pytexttool
Version: 0.3.1
Summary: Text utilities and datasets for generic deep learning, Fork from torchtext but uses numpy to store datasets for more generic use.
Home-page: https://github.com/iclementine/text
Author: Clementine
Author-email: iclementine@outlook.com
License: BSD
Description-Content-Type: UNKNOWN
Description: 
        pytext
        +++++++++
        
        This repository consists of:
        
        * `pytext.data <#data>`_: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)
        * `pytext.datasets <#datasets>`_: Pre-built loaders for common NLP datasets
        
        It is a fork of torchtext, but use numpy ndarray for dataset instead of torch.Tensor or Variable, so as to make it a more generic toolbox for NLP users.
        
        Installation
        ============
        
        
        Make sure you have Python 2.7 or 3.5+ and PyTorch 0.4.0 or newer. You can then install torchtext using pip::
        
            pip install pytexttool
            
        
        Optional requirements
        ---------------------
        
        If you want to use English tokenizer from `SpaCy <http://spacy.io/>`_, you need to install SpaCy and download its English model::
        
            pip install spacy
            python -m spacy download en
        
        Alternatively, you might want to use Moses tokenizer from `NLTK <http://nltk.org/>`_. You have to install NLTK and download the data needed::
        
            pip install nltk
            python -m nltk.downloader perluniprops nonbreaking_prefixes
        
        Data
        ====
        
        The data module provides the following:
        
        * Ability to describe declaratively how to load a custom NLP dataset that's in a "normal" format:
        
          .. code-block:: python
        
              >>> pos = data.TabularDataset(
              ...    path='data/pos/pos_wsj_train.tsv', format='tsv',
              ...    fields=[('text', data.Field()),
              ...            ('labels', data.Field())])
              ...
              >>> sentiment = data.TabularDataset(
              ...    path='data/sentiment/train.json', format='json',
              ...    fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
              ...            'sentiment_gold': ('labels', data.Field(sequential=False))})
        
        * Ability to define a preprocessing pipeline:
        
          .. code-block:: python
        
              >>> src = data.Field(tokenize=my_custom_tokenizer)
              >>> trg = data.Field(tokenize=my_custom_tokenizer)
              >>> mt_train = datasets.TranslationDataset(
              ...     path='data/mt/wmt16-ende.train', exts=('.en', '.de'),
              ...     fields=(src, trg))
        
        * Batching, padding, and numericalizing (including building a vocabulary object):
        
          .. code-block:: python
        
              >>> # continuing from above
              >>> mt_dev = datasets.TranslationDataset(
              ...     path='data/mt/newstest2014', exts=('.en', '.de'),
              ...     fields=(src, trg))
              >>> src.build_vocab(mt_train, max_size=80000)
              >>> trg.build_vocab(mt_train, max_size=40000)
              >>> # mt_dev shares the fields, so it shares their vocab objects
              >>>
              >>> train_iter = data.BucketIterator(
              ...     dataset=mt_train, batch_size=32,
              ...     sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
              >>> # usage
              >>> next(iter(train_iter))
              <data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>
        
        * Wrapper for dataset splits (train, validation, test):
        
          .. code-block:: python
        
              >>> TEXT = data.Field()
              >>> LABELS = data.Field()
              >>>
              >>> train, val, test = data.TabularDataset.splits(
              ...     path='/data/pos_wsj/pos_wsj', train='_train.tsv',
              ...     validation='_dev.tsv', test='_test.tsv', format='tsv',
              ...     fields=[('text', TEXT), ('labels', LABELS)])
              >>>
              >>> train_iter, val_iter, test_iter = data.BucketIterator.splits(
              ...     (train, val, test), batch_sizes=(16, 256, 256),
              >>>     sort_key=lambda x: len(x.text), device=0)
              >>>
              >>> TEXT.build_vocab(train)
              >>> LABELS.build_vocab(train)
        
        Datasets
        ========
        
        The datasets module currently contains:
        
        * Sentiment analysis: SST and IMDb
        * Question classification: TREC
        * Entailment: SNLI
        * Language modeling: abstract class + WikiText-2
        * Machine translation: abstract class + Multi30k, IWSLT, WMT14
        * Sequence tagging (e.g. POS/NER): abstract class + UDPOS
        
        Others are planned or a work in progress:
        
        * Question answering: SQuAD
        
        See the ``test`` directory for examples of dataset usage.
        
Platform: UNKNOWN
