Metadata-Version: 2.0
Name: bigcode-embeddings
Version: 0.1.2
Summary: Tool generate and visualize embeddings from bigcode
Home-page: https://github.com/tuvistavie/bigcode-tools/tree/master/bigcode-embeddings
Author: Daniel Perez
Author-email: tuvistavie@gmail.com
License: UNKNOWN
Download-URL: https://github.com/tuvistavie/bigcode-tools/archive/master.zip
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Dist: pandas
Requires-Dist: scipy
Requires-Dist: numpy
Requires-Dist: scikit-learn
Requires-Dist: matplotlib
Requires-Dist: plotly

# bigcode-embeddings

NOTE: data must be generated with [`bigcode-ast-tools`][2] before being able to use
this tool

`bigcode-embeddings` allows to generate and visualize embeddings for
AST nodes.

## Install

This project should be used with Python 3.

To install the package either run

```
pip install bigcode-embeddings
```

or clone the repository and run

```
cd bigcode-embeddings
pip install -r requirements.txt
python setup.py install
```

NOTE: tensorflow needs to be installed separately.

## Usage

### Training embeddings

Training data can be generated using [`bigcode-ast-tools`][2]

Given a `data.txt.gz` generated from a vocabulary of size 30000,
100D embeddings can be trained using

```
./bin/bigcode-embeddings train -o embeddings/ --vocab-size 30000 --emb-size 100 --l2-value 0.05 --learning-rate 0.01 data.txt.gz
```

[Tensorboard][2] can be used to visualize the progress

```
tensorboard --logdir embeddings/
```

After the first epoch, embeddings visualization becomes available from
Tensorboard. The vocabulary TSV file generated by `bigcode-ast-tools` can
be loaded to have labels on the embeddings.

### Visualizing the embeddings

Trained embeddings can be visualized using the `visualize` subcommand
If the generated vocabulary file is `vocab.tsv`, the above embeddings
can be visualized with the following command

```
./bin/data-explorer visualize clusters -m embeddings/embeddings.bin-STEP -l vocab.tsv
```

where `STEP` should be the largest value found in the `embeddings/` directory.

The `-i` flag can be passed to generate an interactive plot.

[1]: ../bigcode-ast-tools/README.md
[2]: https://github.com/tensorflow/tensorboard


