Metadata-Version: 2.1
Name: dictok
Version: 0.0.3
Summary: A dictionary-based tokenizer.
Home-page: https://github.com/pypa/dictok
Author: Samuel Frontull
Author-email: samuelfrontull@gmail.com
Project-URL: Bug Tracker, https://github.com/pypa/dictok/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE

# DicTok

A dictionary-based tokenizer.
It tokenizes a text based on known tokens.

## Installation

```
pip install dictok
```

## Usage

1. Create your dic-file with a list of tokens e.g. `tokens.dic`:

```
super
man
note
book
store
...
```

2. Import `dictok` and pass it the dictionary file as main parameter:

```
>>> import dictok
>>> dt = dictok.DicTok('tokens.dic')
```

3. You are ready to use it:

```
>>> sent = "Superman bought a notebook in the bookstore."
>>> dt.tokenize(sent)
['Super', 'man', 'bought', 'a', 'note', 'book', 'in', 'the', 'book', 'store', '.']
```

## Options

You can also ignore single characters or unknown tokens:

```
>>> dt.tokenize(sent, include_unknown = False, include_single_chars = False)
['Super', 'man', 'note', 'book', 'book', 'store']
```

If, for example, you want to recognize and correct words with typing errors,
you can do so by specifying them as pair in the dictionary:

```
super
man
note
book
buok,book
store
stohre,store
...
```

```
>>> dt = dictok.DicTok('/home/samuel/pip/tokens.dic')
>>> sent = "Superman bought a notebuok in the bookstohre."
>>> dt.tokenize(sent, include_unknown = False, include_single_chars = False)
['Super', 'man', 'note', 'book', 'book', 'store']
```
