Metadata-Version: 2.0
Name: ipatok
Version: 0.1.1
Summary: IPA tokeniser
Home-page: https://github.com/pavelsof/ipatok
Author: Pavel Sofroniev
Author-email: pavelsof@gmail.com
License: MIT
Keywords: IPA tokeniser tokenizer
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Text Processing :: Linguistic

======
ipatok
======

A simple IPA tokeniser, as simple as in:

>>> from ipatok import tokenise
>>> tokenise('ˈtiːt͡ʃə')
['t', 'iː', 't͡ʃ', 'ə']
>>> tokenise('ʃːjeq͡χːʼjer')
['ʃː', 'j', 'e', 'q͡χːʼ', 'j', 'e', 'r']


api
===

``tokenise(string, strict=False, replace=False, diphtongs=False, merge=None)``
takes an IPA string and returns a list of tokens. A token usually consists of a
single letter together with its accompanying diacritics. If two letters are
connected by a tie bar, they are also considered a single token. Except for
length markers, suprasegmentals are excluded from the output. Whitespace is
also ignored. The function accepts the following keyword arguments:

- ``strict``: if set to ``True``, the function ensures that ``string`` complies
  to the IPA spec (`the 2015 revision`_); a ``ValueError`` is raised if it does
  not. If set to ``False`` (the default), the role of non-IPA characters is
  guessed based on their Unicode category.
- ``replace``: if set to ``True``, the function replaces some common
  substitutes with their IPA-compliant counterparts, e.g. ``g → ɡ``, ``ɫ → l̴``,
  ``ʦ → t͡s``. Refer to ``ipatok/data/replacements.tsv`` for a full list. If
  both ``strict`` and ``replace`` are set to ``True``, replacing is done before
  checking for spec compliance.
- ``diphtongs``: if set to ``True``, the function groups together non-syllabic
  vowels with their syllabic neighbours (e.g. ``aɪ̯`` would form a single
  token). If set to ``False`` (the default), vowels are not tokenised together
  unless there is a connecting tie bar (e.g. ``a͡ɪ``).
- ``merge``: expects a ``str, str → bool`` function to be applied onto each
  pair of consecutive tokens; those for which the output is ``True`` are merged
  together. You can use this to, e.g., plug in your own diphtong detection
  algorithm:

  >>> tokenise(string, diphtongs=False, merge=custom_func)

``tokenize`` is an alias for ``tokenise``.


installation
============

This is a standard Python 3 package without dependencies. It is offered at the
`Cheese Shop`_, so you can install it with pip::

    pip install ipatok

or, alternatively, you can clone this repo (safe to delete afterwards) and do::

    python setup.py test
    python setup.py install

Of course, this could be happening within a virtualenv/venv as well.


other IPA packages
==================

- lingpy_ is a historical linguistics suite that includes an ipa2tokens_
  function.
- ipapy_ is a package for working with IPA strings.
- ipalint_ provides a command-line tool for checking IPA datasets for errors
  and inconsistencies.


licence
=======

MIT. Do as you please and praise the snake gods.

.. _`the 2015 revision`: https://www.internationalphoneticassociation.org/sites/default/files/phonsymbol.pdf
.. _`Cheese Shop`: https://pypi.python.org/pypi/ipatok
.. _`lingpy`: https://pypi.python.org/pypi/lingpy
.. _`ipa2tokens`: http://lingpy.org/reference/lingpy.sequence.html#lingpy.sequence.sound_classes.ipa2tokens
.. _`ipapy`: https://pypi.python.org/pypi/ipapy
.. _`ipalint`: https://pypi.python.org/pypi/ipalint


