Metadata-Version: 2.1
Name: textaugment
Version: 1.0
Summary: A library for augmenting text for natural language processing applications.
Home-page: https://github.com/dsfsi/textaugment
Author: Joseph Sefara
Author-email: sefaratj@gmail.com
License: MIT
Description: # [TextAugment: Improving short text classification through global augmentation methods]() 
        
        TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of [NLTK](https://www.nltk.org/), [Gensim](https://radimrehurek.com/gensim/), and [TextBlob](https://textblob.readthedocs.io/) and plays nicely with them.
        
        ## Citation Paper
        
        **[Improving short text classification through global augmentation methods]()** published to MLDM
        
        ![alt text](https://raw.githubusercontent.com/dsfsi/textaugment/master/augment.png "Augmentation methods")
        
        ### Requirements
        
        * Python 3
        
        The following software packages are dependencies and will be installed automatically.
        
        ```shell
        $ pip install numpy nltk gensim textblob googletrans 
        
        ```
        The following code downloads NLTK corpus for [wordnet](http://www.nltk.org/howto/wordnet.html).
        ```python
        nltk.download('wordnet')
        ```
        The following code downloads [NLTK tokenizer](https://www.nltk.org/_modules/nltk/tokenize/punkt.html). This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. 
        ```python
        nltk.download('punkt')
        ```
        The following code downloads default [NLTK part-of-speech tagger](https://www.nltk.org/_modules/nltk/tag.html) model. A part-of-speech tagger processes a sequence of words, and attaches a part of speech tag to each word.
        ```python
        nltk.download('averaged_perceptron_tagger')
        ```
        Use gensim to load a pre-trained word2vec model. Like [Google News from Google drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit).
        ```python
        import gensim
        model = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
        ```
        Or training one from scratch using your data or the following public dataset:
        
        - [Text8 Wiki](http://mattmahoney.net/dc/enwik9.zip)
        
        - [Dataset from "One Billion Word Language Modeling Benchmark"](http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz)
        
        ### Installation
        
        Install from pip [Recommended] 
        ```sh
        $ pip install textaugment
        or install latest release
        $ pip install git+git@github.com:dsfsi/textaugment.git
        ```
        
        Install from source
        ```sh
        $ git clone git@github.com:dsfsi/textaugment.git
        $ cd textaugment
        $ python setup.py install
        ```
        
        ### How to use
        
        There are three types of augmentations which can be used:
        
        - word2vec 
        
        ```python
        from textaugment import Word2vec
        ```
        
        - wordnet 
        ```python
        from textaugment import Wordnet
        ```
        - translate (This will require internet access)
        ```python
        from textaugment import Translate
        ```
        #### Word2vec-based augmentation
        **Basic example**
        ```python
        >>> from textaugment import Word2vec
        >>> t = Word2vec(model='path/to/gensim/model'or 'gensim model itself')
        >>> t.augment('The stories are good')
        The films are good
        ```
        **Advanced example**
        
        ```python
        >>> runs = 1 # By default.
        >>> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)
        >>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.
        
        >>> t = Word2vec(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5)
        >>> t.augment('The stories are good')
        The movies are excellent
        ```
        #### WordNet-based augmentation
        **Basic example**
        ```python
        >>> import nltk
        >>> nltk.download('punkt')
        >>> nltk.download('wordnet')
        >>> from textaugment import Wordnet
        >>> t = Wordnet()
        >>> t.augment('In the afternoon, John is going to town')
        In the afternoon, John is walking to town
        ```
        **Advanced example**
        
        ```python
        >>> v = True # enable verbs augmentation. By default is True.
        >>> n = False # enable nouns augmentation. By default is False.
        >>> runs = 1 # number of times to augment a sentence. By default is 1.
        >>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.
        
        >>> t = Wordnet(v=False ,n=True, p=0.5)
        >>> t.augment('In the afternoon, John is going to town')
        In the afternoon, Joseph is going to town.
        ```
        #### RTT-based augmentation
        **Example**
        ```python
        >>> src = "en" # source language of the sentence
        >>> to = "fr" # target language
        >>> from textaugment import Translate
        >>> t = Translate(src="en", to="fr")
        >>> t.augment('In the afternoon, John is going to town')
        In the afternoon John goes to town
        ```
        ## Built with ❤ on
        * [Python](http://python.org/)
        
        ## Authors
        * [Joseph Sefara](https://za.linkedin.com/in/josephsefara) (http://www.speechtech.co.za)
        * [Vukosi Marivate](http://www.vima.co.za) (http://www.vima.co.za)
        
        ## Acknowledgements
        Cite this [paper](#) when using this library.
        
        ## Licence
        MIT licensed. See the bundled [LICENCE](LICENCE) file for more details.
Keywords: text augmentation,python,natural language processing,nlp
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Text Processing :: Linguistic
Description-Content-Type: text/markdown
