Metadata-Version: 2.1
Name: wildgram
Version: 0.0.2
Summary: wildgram tokenizes and seperates tokens into ngrams of varying size based on the natural language breaks in the text.
Home-page: https://gitlab.com/gracekatherineturner/wildgram
Author: Grace Turner
Author-email: gracekatherineturner@gmail.com
License: UNKNOWN
Description: wildgram tokenizes english text into "wild"-grams (tokens of varying word count)
        that match closely to the the natural pauses of conversation. I originally built
        it as the first step in an abstraction pipeline for medical language: since
        medical concepts tend to be phrases of varying lengths, bag-of-words or bigrams
        doesn't really cut it.
        
        Wildgrams works by measuring the size of the noise (stopwords, punctuation, and
        whitespace) and breaks up the text against noise of a certain size
        (it varies slightly depending on the noise).
        Some examples:
        "rats, bats, and vats" -> ["rats","bats", "vats"]
        "I dreamed a dream in time gone by" -> ["i dreamed","dream", "time gone"]
        
        Because this is originally for a medical abstraction, some of the stop words include
        words like "denied", "describe", and "patient" which tend to signify a change
        in topic in medical notes. Future work will create a set of change-of-topic words so that
        these words will show up in the output, but by themselves instead of part of a
        larger tokens.
        e.g. currently it does this:
        "patient denies consuming alcohol" -> ["patient", "consuming alcohol"]
        and eventually it will do this:
        "patient denies consuming alcohol" -> ["patient". "denies", "consuming alcohol"]
        But just buyer beware.
        
        Also note that it doesn't strictly tokenize each token like so:
        "I dreamed a dream in time gone by" -> [("i","dreamed"),("dream"), ("time","gone")]
        
        Final note: I do not include "of" in the stop word list, because there are quite a few
        medical concepts that have of in the middle (e.g. "shortness of breath").
        
        
        Example code:
        
        ```python
        from wildgram import wildgram
        tokens, ranges = wildgram("and was beautiful")
        #tokens -- the wildgram tokens
        #ranges -- a list of tuples, the ith tuple has the start and end indexes for the ith wildgram
        print(tokens, ranges) #["beautiful"], [(8, 17)]
        ```
        That's all folks!
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
