Metadata-Version: 2.1
Name: soylemma
Version: 0.1.1
Summary: Trained Korean Lemmatizer
Home-page: https://github.com/lovit/korean_lemmatizer
Author: lovit
Author-email: soy.lovit@gmail.com
License: UNKNOWN
Keywords: korean-nlp,nlp,lemmatizer
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3.6
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown

# 한국어 용언 분석기 (Korean Lemmatizer)

한국어의 동사와 형용사의 활용형 (surfacial form) 을 분석합니다. 한국어 용언 분석기는 다음의 기능을 제공합니다.

1. 입력된 단어를 어간 (stem) 과 어미 (eomi) 으로 분리
1. 입력된 단어를 원형으로 복원

이 패키지의 구현 원리는 [github.io 블로그][io]에 정리하였습니다.

[io]: https://lovit.github.io/nlp/2019/01/22/trained_kor_lemmatizer/

## Usage

### analyze, lemmatize, conjugate

`analyze` function returns morphemes of the given predicator word

```python
from soylemma import Lemmatizer

lemmatizer = Lemmatizer()
lemmatizer.analyze('차가우니까')
```

The return value forms list of tuples because there can be more than one morpheme combination.

```
[(('차갑', 'Adjective'), ('우니까', 'Eomi'))]
```

`lemmatize` function returns lemma of the given predicator word.

```python
lemmatizer.lemmatize('차가우니까')
```

```
[('차갑다', 'Adjective')]
```

If the input word is not predicator such as Noun, it return empty list.

```python
lemmatizer.lemmatize('한국어') # []
```

`conjugate` function returns surfacial form. You should put stem and eomi as arguments. It returns all possible surfacial forms for the given stem and eomi.

```python
lemmatizer.conjugate(stem='차갑', eomi='우니까')
lemmatizer.conjugate('예쁘', '었던')
```

```
['차가우니까', '차갑우니까']
['예뻤던', '예쁘었던']
```

### update dictionaries and rules

For demonstration, we use dictioanry `demo`.

`어여뻤어` cannot be analyzed because the adjective `어여쁘` does not enrolled in dictionary.

```python
from soylemma import Lemmatizer

lemmatizer = Lemmatizer(dictionary_name='demo')
print(lemmatizer.analyze('어여뻤어')) # []
```

So, we add the word with tag using `add_words` function. Do it again. Then you can see the word `어여뻤어` is analyzed.

```python
lemmatizer.add_words('어여쁘', 'Adjective')
lemmatizer.analyze('어여뻤어')
```

```
[(('어여쁘', 'Adjective'), ('었어', 'Eomi'))]
```

However, the word `파랬다` is still not able to be analyzed because the lemmatization rule for surfacial form `랬` does not exist.

```python
lemmatizer.analyze('파랬다') # []
```

So, in this time, we update additional lemmatization rules using `add_lemma_rules` function.

```python
supplements = {
    '랬': {('랗', '았')}
}

lemmatizer.add_lemma_rules(supplements)
```

After that, we can see the word `파랬다` is analyzed, and also conjugation of `파랗 + 았다` is available.

```python
lemmatizer.analyze('파랬다')
lemmatizer.conjugate('파랗', '았다')
```

```
[(('파랗', 'Adjective'), ('았다', 'Eomi'))]
['파랬다', '파랗았다']
```

### debug on

If you wonder which subwords came up as candidates of (stem, eomi), use `debug`.

```python
lemmatizer.analyze('파랬다', debug=True)
```

```
[DEBUG] word: 파랬다 = 파랗 + 았다, conjugation: 랬 = 랗 + 았
[(('파랗', 'Adjective'), ('았다', 'Eomi'))]
```

### lemmatization rule extractor

You can extract lemmatization rule using `extract_rule` function.

```python
from soylemma import extract_rule

eojeol = '로드무비였다'
lw = '로드무비이'
lt = 'Adjective'
rw = '었다'
rt = 'Eomi'

extract_rule(eojeol, lw, lt, rw, rt)
```

```
('였다', ('이', '었다'))
```


