Metadata-Version: 2.1
Name: dostoevsky
Version: 0.2.1
Summary: Sentiment analysis library for russian language
Home-page: https://github.com/bureaucratic-labs/dostoevsky
Author: Bureaucratic Labs
Author-email: hello@b-labs.pro
License: MIT
Keywords: natural language processing,sentiment analysis
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
Description-Content-Type: text/markdown
Requires-Dist: Keras (==2.2.4)
Requires-Dist: b-labs-models (==2017.8.22)
Requires-Dist: gensim (==3.7.3)
Requires-Dist: pymorphy2 (==0.8)
Requires-Dist: pytest (==5.0.0)
Requires-Dist: razdel (==0.4.0)
Requires-Dist: russian-tagsets (==0.6)
Requires-Dist: scikit-learn (==0.21.2)
Requires-Dist: tensorflow (==1.14.0)

# Dostoevsky [![Build Status](https://travis-ci.org/bureaucratic-labs/dostoevsky.svg?branch=master)](https://travis-ci.org/bureaucratic-labs/dostoevsky) [![FOSSA Status](https://app.fossa.io/api/projects/git%2Bgithub.com%2Fbureaucratic-labs%2Fdostoevsky.svg?type=shield)](https://app.fossa.io/projects/git%2Bgithub.com%2Fbureaucratic-labs%2Fdostoevsky?ref=badge_shield)

<img align="right" src="https://i.imgur.com/uLMWPuL.png">

Sentiment analysis library for russian language

## Install

Please note that `Dostoevsky` supports only Python 3.6+

```bash
$ pip install dostoevsky
```

## Social networks comment model

This model was trained on [RuSentiment dataset](https://github.com/text-machine-lab/rusentiment) and achieves up to ~0.70 F1 score  
![](https://i.imgur.com/bGAEWvg.png)

### Usage

First of all, you'll need to download pretrained word embeddings and model:

```bash
$ dostoevsky download vk-embeddings cnn-social-network-model
```

Then, we can build our pipeline: `text -> tokenizer -> word embeddings -> CNN`

```python
from dostoevsky.tokenization import UDBaselineTokenizer, RegexTokenizer
from dostoevsky.embeddings import SocialNetworkEmbeddings
from dostoevsky.models import SocialNetworkModel

tokenizer = UDBaselineTokenizer() or RegexTokenizer()
tokens = tokenizer.split('всё очень плохо')  # [('всё', 'ADJ'), ('очень', 'ADV'), ('плохо', 'ADV')]

embeddings_container = SocialNetworkEmbeddings()

vectors = embeddings_container.get_word_vectors(tokens)
vectors.shape  # (3, 300) - three words/vectors with dim=300

model = SocialNetworkModel(
  tokenizer=tokenizer,
  embeddings_container=embeddings_container,
  lemmatize=False,
)

messages = [
    'наступили на ногу',
    'всё суперски',
]

results = model.predict(messages)

for message, sentiment in zip(messages, results):
    print(message, '->', sentiment)  # наступили на ногу -> negative
```


## License
[![FOSSA Status](https://app.fossa.io/api/projects/git%2Bgithub.com%2Fbureaucratic-labs%2Fdostoevsky.svg?type=large)](https://app.fossa.io/projects/git%2Bgithub.com%2Fbureaucratic-labs%2Fdostoevsky?ref=badge_large)


