Metadata-Version: 1.2
Name: pybo
Version: 0.2.0
Summary: Python utils for processing Tibetan
Home-page: https://github.com/Esukhia/pybo
Author: Esukhia development team
Author-email: esukhiadev@gmail.com
License: Apache2
Description: <img src=https://raw.githubusercontent.com/mikkokotila/pybo/master/pybo_logo.png width=200>
        
        [![Build Status](https://travis-ci.org/Esukhia/pybo.svg?branch=master)](https://travis-ci.org/Esukhia/pybo)  [![Coverage Status](https://coveralls.io/repos/github/Esukhia/pybo/badge.svg?branch=master)](https://coveralls.io/github/Esukhia/pybo?branch=master)
        
        ## Overview
        
        pybo is a word tokenizer for the Tibetan language entirely written in Python. pybo takes in chuncks of text, and returns lists of words. It provides an easy-to-use, high-performance tokenization pipeline that can be adapted either as a stand-alone solution or compliment.
        
        ## Getting Started 
        
            pip install pybo
            
        Or if you for some reason want to install from the latest Master branch:
        
            pip install git+https://github.com/Esukhia/pybo.git
        
        ## Use 
        
        #### To initiate the tokenizer together with part-of-speech capability: 
        
            # initialize the tokenizer
            pybo = bo.BoTokenizer('POS')
            
            # read in some Tibetan text
            input_str = '༄༅། །རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྻ་ཨ་བ་ཏ་ར། བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ། །སངས་རྒྱས་དང་བྱང་ཆུབ་སེམས་དཔའ་ཐམས་ཅད་ལ་ཕྱག་འཚལ་ལོ། །བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང༌། །ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །བདེ་གཤེགས་སྲས་ཀྱི་སྡོམ་ལ་འཇུག་པ་ནི། །ལུང་བཞིན་མདོར་བསྡུས་ནས་ནི་བརྗོད་པར་བྱ། །'
            
            # run the tokenizer
            tokens = tok.tokenize(input_str)
            
        #### Now in 'tokens' you have an iterable where each token consist of several meta-data:
        
            # access the first token in the iterable
            tokens[0]
        
        This will yield:
        
            content: "༄༅། "
            char types: |punct|punct|punct|space|
            type: punct
            start in input: 0
            length: 4
            syl chars in content: None
            tag: punct
            POS: punct
            skr: "False"
            freq: 
            
        #### In case you want to access all words in a list: 
        
            # iterate through the tokens object to get all the words in a list
            [t.content for t in tokens]
        
        #### Or just get all the nouns that were used in the text
        
            # extract nouns from the tokens
            [t.content for t in tokens if t.tag == 'NOUNᛃᛃᛃ']
            
        These examples highlight the basic principle of accessing attributes within each token object. 
        
        
        
        
Keywords: nlp computational_linguistics search ngrams language_models linguistics toolkit tibetan
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: Tibetan
Requires-Python: >=3
