Metadata-Version: 2.1
Name: removedup
Version: 1.0.1
Summary: Remove duplicates from parallel corpora
Home-page: https://github.com/LibreTranslate/RemoveDUP
Author: Piero Toffanin
Author-email: pt@masseranolabs.com
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: test
Requires-Dist: pytest >=6.0 ; extra == 'test'

# RemoveDup

A fast, memory efficient Python module to remove duplicates from parallel text corpora.

It's useful for cleaning up datasets that contain duplicate entries for training language models.

## Installation

```bash
pip install removedup
```

## Usage

```python
from removedup import rdup

src, tgt, removed = rdup("source.txt", "target.txt")
print(src, tgt, removed)
# source.txt.dedup
# target.txt.dedup
# <num lines removed>
```

## Notes

Source and target must have the same number of lines. No validation checks are made.

Duplication checks are only made on the source content. If you want to check for duplicates on the target, simply switch the order of the parameters.

## Build

```bash
git clone https://github.com/LibreTranslate/RemoveDup
cd RemoveDup
python setup.py build
```

## Standalone Binary

You can also use removedup as a standalone Windows, macOS or Linux application (but you currently need to build from source, we don't provide binaries).

```
mkdir build
cd build && cmake .. && make -j4
./rdup source.txt target.txt
```

## Contributing

We welcome pull requests!

## License

AGPLv3
