Metadata-Version: 1.2
Name: pybo
Version: 0.1.3
Summary: Python utils for processing Tibetan
Home-page: https://github.com/Esukhia/pybo
Author: Esukhia development team
Author-email: esukhiadev@gmail.com
License: Apache2
Description: pybo
        ====
        
        Pybo is a python tokenizer for Tibetan built as a tokenizer plugin for
        spaCy and the Tibetan editor. It takes in a string of raw tibetan text
        and spits out a list of Token objects.
        
        Goals of pybo
        -------------
        
        Using pybo, one should be able to:
        
        1. pre-process any Tibetan string for tokenization
        2. tokenize any pre-processed string into a sequence of word/non-word
           tokens
        3. apply matchers over the tokenized text (list of Token objects) to
           modify the segmentation
        4. transform the list of Token objects to and from a spaCy Doc object.
        
        1. Tibetan String Pre-processing
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        Status: Done.
        
        Strategy: - BoString attributes a type to every char in the input
        string. - BoChunk creates chunks of similar chars. (subclass of
        BoString) - PyBoChunk creates meaningful chunks for Tibetan language:
        syl / (other)bo / punct / non-bo. (subclass of BoChunk) - PyBoTextChunks
        provides cleaned content for syllable chunks (no punct no space).
        (subclass of PyBoChunk)
        
        2. Tokenization of pre-processed string
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        Status: Implemented. Cleanup to be done.
        
        Strategy:
        
        -  SylComponents gives morphologic information about a Tibetan syllable.
        -  BoSyl (uses SylComponents):
        
           -  BoSyl.is\_affixable(): tells whether a given syllable can be
              affixed or not
           -  BoSyl.get\_all\_affixed(): for a given syllable, gives all affixed
              variants. for each variant, gives: 1. the final form, 2. the
              particle used, 3. True if its non-affixed version ends with འ,
              False otherwise
        
        -  Trie + Node: Object Oriented Trie implementation
        -  PyBoTrie (subclass of Trie, uses BoSyl):
        
           -  builds a trie from a lexicon
           -  adds affixed particle and POS information in the trie
           -  allows to dynamically add / deactivate entries in a trie
           -  walks an existing trie to find the longest possible match
        
        -  Tokenizer (uses PyBoTrie and Token):
        
           -  input: pre-processed syllables from PyBoTextChunks
           -  parses sequences of clean syllables to find the longest word
              inside the loaded trie
           -  builds a word token from a sequence of syllable chunks.
           -  builds a Token object from individual chunks (non-bo, punct) and
              from a word token.
        
        -  SplitAffixed splits Token objects that end with an affixed particle
           into 1. the token, 2. a token for the affixed particle
        
        Todo:
        
        -  rename Token object into Token
        -  remove tibetaneditor specific attributes and properties + find a way
           of reimplementing them within tibetaneditor
        -  implement SplitAffixed:
        
           -  check in Token.tag if there is an affixed particle
           -  use the given particle type to reconstruct the lemma
           -  use the given particle length to know where Token.content should
              be split
           -  add a final འ to the lemma of the host Token if needed
        
        3. Applying Matchers
        ~~~~~~~~~~~~~~~~~~~~
        
        Status: Implementation ongoing.
        
        Strategy:
        
        -  Matcher finds sequences of Token objects that match a given input CQL
           query
        -  Splitter takes a Token object and splits it in two.
        
           -  input: the index in Token.content where to split
        
        -  Merger takes two consecutive Token objects to create a merged Token
           object. (the metadata attributes from one token are discarded, the
           content attributes are concatenated)
        -  BoMatcher (uses Matcher and either Splitter or Merger):
        
           -  input: a list of Token object, a matcher, a Splitter / a Merger
           -  loops over a list of Token objects (output of Tokenizer)
        
              -  checks whether the sublist[current index: current index +
                 len(matcher)] satisfies Matcher
              -  applies either Splitter or Merger if necessary
        
           -  output: the modified list of Token objects
        
        Todo:
        
        -  replace the basic query parser by ``third-party/cql.py``
        -  move the matching logic from BoMatcher to Matcher
        -  implement Splitter
        -  implement Merger
        -  implement BoMatcher
        
        4. To and From spaCy
        ~~~~~~~~~~~~~~~~~~~~
        
        Status: To do.
        
        Strategy:
        
        -  use the spaCy api to make the conversion
        
        Licence
        -------
        
        The code is Copyright 2018 Esukhia, and is provided under `Apache
        Licence 2.0 <LICENSE>`__
        
Keywords: nlp computational_linguistics search ngrams language_models linguistics toolkit tibetan
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: Tibetan
Requires-Python: >=3
