Metadata-Version: 2.1
Name: scikit-bloom
Version: 0.2.0
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Dist: scikit-learn >=1.0.0
Requires-Dist: scikit-partial >=0.1.0
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

<img src="icon.svg" width="125" height="125" align="right" />

# Scikit-Bloom 

> An excuse to play with Rust, but also a neat trick for sklearn!

This package contains some bloom tricks for text pipelines in scikit-learn. To learn more about this trick, check out [this blogpost](https://explosion.ai/blog/bloom-embeddings).

You can install it via:

```
python -m pip install scikit-bloom
```

And you can import the components via: 

```python
from skbloom import BloomVectorizer, BloomishVectorizer, SlowBloomVectorizer

BloomVectorizer().fit(X).transform(X)
BloomishVectorizer().fit(X).transform(X)
```

The `BloomVectorizer` will use rust under the hood for the hashing to construct the bloom representation. The `BloomishVectorizer` will just run the [HashingVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) from scikit-learn multiple times in sequence. The `SlowBloomVectorizer` is pretty much the same as the `BloomVectizer` in terms of features, but is implemented in Python.

## Benchmarks 

I ran a quick benchmark, which seems to suggest the approach is pretty speedy. 

<details>
    <summary>Show me the code</summary>

```python
import time
from datasets import load_dataset
from skbloom import BloomVectorizer, BloomishVectorizer, SlowBloomVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

dataset = load_dataset("clinc_oos", "plus")
texts = dataset['train']['text'] * 10

trials = [BloomVectorizer(n_features=10_000), 
          BloomishVectorizer(n_features=10_000), 
          SlowBloomVectorizer(n_features=10_000), 
          HashingVectorizer(n_features=10_000)]

for trial in trials:
    tic = time.time()
    trial.fit_transform(texts)
    toc = time.time()
    print(f"{trial.__class_.__name__}: {toc - tic}")
```
</details>

In this benchmark we're creating a 

| Approach            | Time taken | Description 
| ------------------- | ---------- | ------------
| BloomVectorizer     | 1.562      | The speedy rust implementation
| BloomishVectorizer  | 2.111      | Using sklearn's implementation sequentially 
| SlowBloomVectorizer | 5.259      | A pure python implementation
| HashingVectorizer   | 0.695      | Using sklearn's hashing vectorizer to only hash once

You can also choose to run the `BloomVectorizer` by just hashing once and it seems to be competative with the `HashingVectorizer`. 

<details>
    <summary>Show me the code</summary>

```python
import time
from datasets import load_dataset
from skbloom import BloomVectorizer, BloomishVectorizer, SlowBloomVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

dataset = load_dataset("clinc_oos", "plus")
texts = dataset['train']['text'] * 10

for feats in [3000, 5000, 10000, 20000, 100_000]:
    trials = [BloomVectorizer(n_hash=1, n_features=feats), HashingVectorizer(n_features=feats)]
    for trial in trials:
        tic = time.time()
        trial.fit_transform(texts)
        toc = time.time()
        print(f"{feats}: {trial.__class__.__name__}: {toc - tic}")

```
</details>

| Number of feats     | `BloomVectorizer` | `HashingVectorizer` 
| ------------------- | ----------------- | ------------
| 3000                | 0.6071            | 0.6864
| 5000                | 0.6092            | 0.6947
| 10000               | 0.6123            | 0.6911
| 20000               | 0.6124            | 0.6918
| 100000              | 0.6108            | 0.6938 


I want to be careful with suggesting that the `BloomVectorizer` is always faster 
because the `HashingVectorizer` comes with way more features. You can build n-gram representations, just to mention one example, which the `BloomVectorizer` does not do. But it does seem like it is at least competative, which is neat.

## Important 

In fairness, while this trick is interesting ... you _might_ be fine just using the `HashingVectorizer` that just comes with sklearn. This project works, but it was also an excuse for me to try out rust.

It's a nice motivating example for me to learn rust, partially because it's a tangible example from a field that I am familiar with. But it's also been a relatively low investment to rewrite an expensive bit of code in rust. 

## Development

These are mainly some notes for myself. 

To install all of this locally; 

```
python -m pip install maturin 
maturin develop
python -m pip install -e .
```

