Metadata-Version: 2.1
Name: extra-model
Version: 0.2.0rc1
Summary: Code to run the Extra algorithm for unsupervised topic extraction.
Home-page: https://github.com/wayfair-incubator/extra-model
Author: Misha Balyasin
Author-email: mbalyasin@wayfair.com
License: MIT License
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Description-Content-Type: text/markdown
Requires-Dist: numpy (==1.15.2)
Requires-Dist: nltk (==3.4.5)
Requires-Dist: scikit-learn (==0.20.0)
Requires-Dist: vaderSentiment (==3.3.2)
Requires-Dist: pandas (==0.23.3)
Requires-Dist: pycld2 (==0.31)
Requires-Dist: networkx (==2.2)
Requires-Dist: gensim (==3.5.0)
Requires-Dist: cytoolz (==0.9.0)
Requires-Dist: spacy (==2.0.18)
Requires-Dist: requests (==2.20.1)

[![codecov](https://codecov.io/gh/wayfair-incubator/extra-model/branch/main/graph/badge.svg?token=HXSGN5IUzu)](https://codecov.io/gh/wayfair-incubator/extra-model)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)

TODO: update badges

# extra-model

Code to run the Extra [algorithm](https://aclanthology.info/papers/D18-1384/d18-1384) for unsupervised topic extraction.

## Quick start

**IMPORTANT**: 
1. When running Extra inside docker-container, make sure that Docker process has enough resources. 
For example, on Mac/Windows it should have at least 8 Gb of RAM available to it.
1. GitHub repo does **not** come with Glove Embeddings. See the next section for how to download the reuired embeddings.

### Downloading Embeddings

**This package does not come with the required Glove embeddings and they must be downloaded before use.**

To download the required embeddings, run the following command:

```bash
docker-compose run --rm setup
```

The embeddings will be downloaded, unzipped and formatted into a space-efficient format. Files will be saved in the `embeddings/` directory in the root of the project directory. If the process fails, it can be safely restarted. If you want to restart the process with new files, delete all files except `README.md` in the `embeddings/` directory.

### Using docker-compose

First, build the image:

```bash
docker-compose build
```

Then running `extra-model` is as simple as:

```bash
docker-compose run extra-model /package/tests/resources/100_comments.csv
```

NOTE: when using this approach, input file should be mounted inside the container.
By default, everything from `extra-model` folder will be mounted to `/package/` folder. 
This can be changed in `docker-compose.yaml`

This will produce a `result.csv` file in `/io/` (default setting) folder. 

Location of the output can be changed by supplying second path, e.g.:

```bash
docker-compose run extra-model /package/tests/resources/100_comments.csv /io/another_folder
```

### Using command line

TODO: add this section.

## `extra-model` input

Input of an extra is a `.csv` file with 2 columns: `CommentId` and `Comments`. 
Both must be present and named exactly in that way.

## `extra-model` output

After `extra-model` finishes calculations, it'll produce a `.csv` file with following structure:

```text
AdCluster,Aspect,AspectCount,CommentId,Descriptor,Position,SentimentBinary,SentimentCompound,Topic,TopicCount,TopicImportance,TopicSentimentBinary,TopicSentimentCompound,WordnetNode
only,downside,1,321,only,9,0.0,0.0,downside.n.01,1,0.005572645018795278,0.0,0.0,downside.n.01
more,nothing,1,74,more,54,0.0,0.0,nothing.n.01,1,0.005572645018795278,0.0,0.0,nothing.n.01
clean,bathrooms,1,146,clean,4,1.0,0.4019,toilet.n.01,1,0.005572645018795278,1.0,0.4019,toilet.n.01
decorated,place,5,146,decorated,32,0.0,0.0,home.n.01,6,0.03343587011277168,0.0,-0.01131666666666666,home.n.01
```

Columns have following meaning:

|Column                      | Description |
|:----------------------|:----------------------|
|AdCluster              |Adjectives are clustered together and this indicates the "center" of a cluster (e.g., "awesome", "fantastic", "great" descriptors might produce "great" as `AdCluster`)|
|Aspect                 |Identified aspect - this is an actual word that person wrote in a text|
|AspectCount            |How often this aspect has been found in all of the input|
|CommentId              |`ID` of an input. Since one input may produce multiple aspects, `ID` column must always be present|
|Descriptor             |Identified adjective (not clustered) - this is an actual word that person wrote in a text|
|Position               |Character number where aspect was found (e.g., "nice shirt" will have aspect "shirt" and `Position` 6|
|SentimentBinary        |Binary sentiment for aspect|
|SentimentCompound      |Compound sentiment for aspect|
|Topic                  |Collection of aspects.|
|TopicCount             |How often topic has been found in input|
|TopicImportance        |Importance of a topic|
|TopicSentimentBinary   |Similar to aspect, but on a topic level|
|TopicSentimentCompound |Similar to aspect, but on a topic level|
|WordnetNode            |Mapping to `wordnet` node. Identifiers in the form `.n.01` mean first meaning of the noun in `wordnet`| 


## Extra workflow

The workflow follows the algorithm suggested in the paper and has following stages:

### Filtering (`_filter.py`)
Get rid of cruft in the input data:
*  empty text fields
*  requires at least 20 characters of text
*  remove unprintable unicode characters
*  filter for english language using Googles `cld2` tool

### Generate aspects (`_aspects.py`)
Extracts promising phrases (i.e., nouns described by adjectives) using `spacy`.

### Aggregate aspects into topics (`_topics.py`)
Takes the output of the phrase extraction, maps them to `wordnet` (via `_disambiguate.py`) and produces the list of clustered aspects
important dependencies:
* `sklearn` for clustering
* `nltk` for the `wordnet`
* `networkx` for the semantic tree
* pretrained word-vectors (via `_vectorizer.py`)
* `vaderSentiment` for sentiment analysis

### Analyze descriptors (`_adjectives.py`)
Cluster the associated adjectives using constant radius clustering. 

### Link information (`_summarize.py`)
To make the output more useful, we want to link the topics back to the original texts and vice versa.

The whole code produces one csv file.

## CI

TODO: update this section

This project comes with a GitHub Actions pipeline definition. 

## Develop

First, please [install docker](https://docs.docker.com/install/) on your computer.
Docker must be running correctly for these commands to work. 

* If you are using windows, please make sure your editor writes files with the linefeed (`\n`) line endings.*

Next, clone the repo:

TODO: update this

```bash
git clone 
cd extra-model
```

Then run the test suite to see if docker is set up correctly:

```bash
docker-compose run test
```

## Testing

You'll be unable to merge code unless the linting and tests pass. You can run these in your container via `docker-compose run test`.

The tests, linting, and code coverage are run automatically via CI, and you'll see the output on your pull requests.

Generally we should endeavor to write tests for every feature. 
Every new feature branch should increase the test coverage rather than decreasing it.

We use [pytest](https://docs.pytest.org/en/latest/) as our testing framework.

To test/lint your project, you can run `docker-compose run test`.

#### Stages

TODO: update this 

To customize / override a specific testing stage, please read the documentation specific to that tool:

1. PyTest: [https://docs.pytest.org/en/latest/contents.html](https://docs.pytest.org/en/latest/contents.html)
1. Black: [https://black.readthedocs.io/en/stable/](https://black.readthedocs.io/en/stable/)
1. Flake8: [http://flake8.pycqa.org/en/latest/](http://flake8.pycqa.org/en/latest/)
1. Bandit: [https://bandit.readthedocs.io/en/latest/](https://bandit.readthedocs.io/en/latest/)
1. iSort: [https://pycqa.github.io/isort/](https://pycqa.github.io/isort/)
1. pydocstyle: [http://www.pydocstyle.org/en/stable/](http://www.pydocstyle.org/en/stable/)


## Documentation

TODO: change to sphinx site:

Check out the [project documentation]().

# Authors

`extra-model` was written by `mbalyasin@wayfair.com`, `mmozer@wayfair.com`.


