Metadata-Version: 2.3
Name: polars_sim
Version: 0.2.3
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Dist: polars[pyarrow] >=1.8.2
License-File: LICENSE.md
Summary: Fast similarity join for polars DataFrames.
Author-email: "Dr. André Schemaitat" <a.schemaitat@gmail.com>
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# polars_sim

## Description

Implements an **approximate join** of two polars dataframes based on string columns.


Right now, we use a fixed vectorization, which is applied on the fly and eventually
used in a sparse matrix multiplication combined with a top-n selection. This produces
the cosine similarities of the individual string pairs.

The `join_sim` function is similar to `join_asof` but for strings instead of timestamps.

## Installation

```bash
pip install polars_sim
```

## Development

We use [uv](https://docs.astral.sh/uv/) for python package management. Furthermore, you need rust to be installed, see [install rust](https://www.rust-lang.org/tools/install). You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run
```bash
# create a virtual environment
uv venv --seed -p 3.11
# install dependencies
uv pip install -e .
# install dev dependencies
uv pip install -r requirements.txt
# compiple rust code
make install 
# run tests
make test
```

## Usage

```python
import polars as pl
import polars_sim as ps

df_left = pl.DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie", "David"],
    }
)

df_right = pl.DataFrame(
    {
        "name": ["Ali", "Alice in Wonderland", "Bobby", "Tom"],
    }
)

df = ps.join_sim(
    df_left,
    df_right,
    on="name",
    top_n=4,
)

shape: (3, 3)
┌───────┬──────────┬─────────────────────┐
│ name  ┆ sim      ┆ name_right          │
│ ---   ┆ ---      ┆ ---                 │
│ str   ┆ f64      ┆ str                 │
╞═══════╪══════════╪═════════════════════╡
│ Alice ┆ 0.57735  ┆ Ali                 │
│ Alice ┆ 0.522233 ┆ Alice in Wonderland │
│ Bob   ┆ 0.57735  ┆ Bobby               │
└───────┴──────────┴─────────────────────┘
```

# Performance

A benchmark can be executed with `make run-bench`. 
In general, the performance heavily depends on the length of the dataframes.
By default, the computation is parallelized over one of the two dataframes, depending on the sizes.
If the left dataframe is comparatively small, the computation is parallelized over the right dataframe and vice versa. The behaviour can be fixed with the `threading_dimenstion` parameter.

# References

The implementation is based on an algorithm used in [sparse_dot_topn](https://github.com/ing-bank/sparse_dot_topn), which itself is an improvement of the scipy sparse matrix multiplication.

