Metadata-Version: 1.1
Name: jamotools
Version: 0.1.8
Summary: A library for Korean Jamo split and vectorize.
Home-page: https://github.com/HaebinShin/jamotools
Author: Haebin Shin
Author-email: sunsal0704@gmail.com
License: GPL
Description: Jamotools
        =========
        
        |Build Status| |GitHub Tag| |PyPI version| |Python version| |License|
        
        A library for Korean Jamo split and vectorize.
        
        Install
        -------
        
        .. code:: sh
        
            pip install jamotools
        
        Unicode of Korean
        -----------------
        
        According to the Version 9.0.0 database of the Unicode Consortium, the
        blocks specified in *Hangul* (Korean) in Unicode are as follows.
        
        -  Hangul Jamo: 1100 ~ 11FF
        -  WON SIGN in Currency Symbols: 20A9
        -  HANGUL DOT TONE MARK in CJK Symbols and Punctuation: 302E ~ 302F
        -  Hangul Compatibility Jamo : 3130 ~ 318F
        -  Hangul in Enclosed CJK Letters and Months: 3200 ~ 321E, 3260 ~ 327F
        -  Hangul Jamo Extended-A : A960 ~ A97F
        -  Hangul Syllables : AC00 ~ D7AF
        -  Hangul Jamo Extended-B : D7B0 ~ D7FF
        -  Halfwidth Hangul variants in Halfwidth and Fullwidth Forms: FFA0 ~
           FFDC
        -  FULLWIDTH WON SIGN in Halfwidth and Fullwidth Forms: FFE6
        
        Jamo
        ~~~~
        
        Hangul is made of basic letters called *Jamo*. In unicode, Jamo is
        defined by several kinds which contain old Hangul that does not use in
        nowadays. Jamotools only supports modern Hangul Jamo area as follows.
        
        -  `Hangul Jamo <http://unicode.org/charts/PDF/U1100.pdf>`__: Consist of
           Choseong, Jungseong, Jongseong. It is divided mordern Hangul and old
           Hangul that does not use in nowadays. Jamotools supports modern
           Hangul Jamo area.
        
           -  1100 ~ 1112 (Choseong)
           -  1161 ~ 1175 (Jungseong)
           -  11A8 ~ 11C2 (Jongseong)
        
        -  `Hangul Compatibility
           Jamo <http://unicode.org/charts/PDF/U3130.pdf>`__: It is a Korean
           Hangul language area that is compatible with the Hangul character
           standard (KS X 1001). It is not divided Choseong, Jungseong,
           Jongseong.
        
           -  3131 ~ 3163 (modern Hangul Jamo area)
        
        -  `Halfwidth Hangul
           variants <http://unicode.org/charts/PDF/UFF00.pdf>`__: This is the
           Korean half-width symbol area. Only modern Korean Jamo exists. The
           general Korean Hangul characterization method is the full-width.
        
           -  FFA1 ~ FFDC
        
        Manipulating Korean Jamo
        ------------------------
        
        API for split syllables and join jamos to syllable is based on
        `hangul-utils <https://github.com/kaniblu/hangul-utils/blob/master/README.md#manipulating-korean-characters>`__.
        
        -  ``split_syllables``: Converts a string of syllables to a string of
           jamos, can be select which convert unicode type.
        -  ``join_jamos``: Converts a string of jamos to a string of syllables.
        -  ``normalize_to_compat_jamo``: Normalize a string of jamos to a string
           of *Hangul Compatibility Jamo*.
        
        .. code:: py
        
            >>> import jamotools
            >>> print(jamotools.split_syllable_char(u"안"))
            ('ㅇ', 'ㅏ', 'ㄴ')
        
            >>> print(jamotools.split_syllables(u"안녕하세요"))
            ㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛ
        
            >>> sentence = u"앞 집 팥죽은 붉은 팥 풋팥죽이고, 뒷집 콩죽은 햇콩 단콩 콩죽.우리 집
                깨죽은 검은 깨 깨죽인데 사람들은 햇콩 단콩 콩죽 깨죽 죽먹기를 싫어하더라."
            >>> s = jamotools.split_syllables(sentence)
            >>> print(s)
            ㅇㅏㅍ ㅈㅣㅂ ㅍㅏㅌㅈㅜㄱㅇㅡㄴ ㅂㅜㄺㅇㅡㄴ ㅍㅏㅌ ㅍㅜㅅㅍㅏㅌㅈㅜㄱㅇㅣㄱㅗ,
            ㄷㅟㅅㅈㅣㅂ ㅋㅗㅇㅈㅜㄱㅇㅡㄴ ㅎㅐㅅㅋㅗㅇ ㄷㅏㄴㅋㅗㅇ ㅋㅗㅇㅈㅜㄱ.ㅇㅜㄹㅣ
            ㅈㅣㅂ ㄲㅐㅈㅜㄱㅇㅡㄴ ㄱㅓㅁㅇㅡㄴ ㄲㅐ ㄲㅐㅈㅜㄱㅇㅣㄴㄷㅔ ㅅㅏㄹㅏㅁㄷㅡㄹㅇㅡㄴ
            ㅎㅐㅅㅋㅗㅇ ㄷㅏㄴㅋㅗㅇ ㅋㅗㅇㅈㅜㄱ ㄲㅐㅈㅜㄱ ㅈㅜㄱㅁㅓㄱㄱㅣㄹㅡㄹ
            ㅅㅣㅀㅇㅓㅎㅏㄷㅓㄹㅏ.
        
            >>> sentence2 = jamotools.join_jamos(s)
            >>> print(sentence2)
            앞 집 팥죽은 붉은 팥 풋팥죽이고, 뒷집 콩죽은 햇콩 단콩 콩죽.우리 집 깨죽은 검은 깨
            깨죽인데 사람들은 햇콩 단콩 콩죽 깨죽 죽먹기를 싫어하더라.
        
            >>> print(sentence == sentence2)
            True
        
        Jamotools' API supports multiple unicode area of Hangul Jamo for
        manipulating. Also consists of additional API for manipulating Korean
        jamo.
        
        .. code:: py
        
            >>> sentence = u"자모"
        
            >>> jamos1 = jamotools.split_syllables(sentence, jamo_type="JAMO")
            >>> print([hex(ord(c)) for c in jamos1])
            ['0x110C', '0x1161', '0x1106', '0x1169']
            >>> sentence1 = jamotools.join_jamos(jamos1)
            >>> print(sentence1)
            안녕하세요. hello 1
        
            >>> jamos2 = jamotools.split_syllables(sentence, jamo_type="COMPAT")
            >>> print([hex(ord(c)) for c in jamos2])
            ['0x3148', '0x314F', '0x3141', '0x3157']
            >>> sentence2 = jamotools.join_jamos(jamos2)
            >>> print(sentence2)
            안녕하세요. hello 1
        
            >>> jamos3 = jamotools.split_syllables(sentence, jamo_type="HALFWIDTH")
            >>> print([hex(ord(c)) for c in jamos3])
            ['0xFFB8', '0xFFC2', '0xFFB1', '0xFFCC']
            >>> sentence3 = jamotools.join_jamos(jamos3)
            >>> print(sentence3)
            안녕하세요. hello 1
        
            >>> print(sentence == sentence1 == sentence2 == sentence3)
            True
        
            >>> normalize1 = jamotools.normalize_to_compat_jamo(jamos1)
            >>> normalize2 = jamotools.normalize_to_compat_jamo(jamos2)
            >>> normalize3 = jamotools.normalize_to_compat_jamo(jamos3)
            >>> print(jamos1 == jamos2 == jamos3)
            False
            >>> print(normalize1 == normalize2 == normalize3)
            True
        
        Vectorize Korean Jamo
        ---------------------
        
        Jamotools support vectorize function following RULE. Each RULE is
        defined how split sentence to Jamo and convert which type of symbols. It
        can be used character-level Korean text processing.
        
        -  ``Vectorizationer``: Class for vectorize text by Rule and pad.
        
        .. code:: py
        
            >>> v = jamotools.Vectorizationer(rule=jamotools.rules.RULE_1, \
                                              max_length=None, \
                                              prefix_padding_size=0)
            >>> print(v.vectorize(u"안녕"))
            [13, 21, 45,  4, 27, 62]
        
        Custom RULE
        ~~~~~~~~~~~
        
        Jamotools can add user's custom RULE class as following steps.
        
        1. Make custom RULE class which inherit RuleBase (e.g. Rule2) in
           `rules.py <https://github.com/HaebinShin/jamotools/blob/master/jamotools/vector/rules.py>`__
           like Rule1.
        2. Add constant for custom RULE like
           `RULE\_1 <https://github.com/HaebinShin/jamotools/blob/master/jamotools/vector/rules.py#L39-L41>`__.
        3. Modify
           `get\_rule <https://github.com/HaebinShin/jamotools/blob/master/jamotools/vector/rules.py#L53-L54>`__
           function to return custom RULE class.
        
        Then it can be use as same as RULE\_1 usage.
        
        .. code:: py
        
            >>> v = jamotools.Vectorizationer(rule=jamotools.rules.RULE_2, \
                                              max_length=None, \
                                              prefix_padding_size=0)
        
        .. |Build Status| image:: https://travis-ci.org/HaebinShin/jamotools.svg?branch=master
           :target: https://travis-ci.org/HaebinShin/jamotools
        .. |GitHub Tag| image:: https://img.shields.io/github/tag/HaebinShin/jamotools.svg?label=github+tag
           :target: https://github.com/HaebinShin/jamotools/tags
        .. |PyPI version| image:: https://img.shields.io/pypi/v/jamotools.svg
           :target: https://pypi.python.org/pypi/jamotools/
        .. |Python version| image:: https://img.shields.io/pypi/pyversions/jamotools.svg
           :target: https://pypi.python.org/pypi/jamotools/
        .. |License| image:: https://img.shields.io/pypi/l/jamotools.svg
           :target: https://github.com/HaebinShin/jamotools/blob/master/LICENSE
        
Platform: UNKNOWN
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Natural Language :: Korean
