Metadata-Version: 2.1
Name: datatrove
Version: 0.0.1.dev0
Summary: HuggingFace library to process and filter large amounts of webdata
Home-page: https://github.com/huggingface/datatrove
Author: HuggingFace Inc.
Author-email: guilherme@huggingface.co
License: Apache 2.0
Keywords: data machine learning processing
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.7.0
Description-Content-Type: text/markdown
Requires-Dist: boto3==1.28.78
Requires-Dist: cchardet==2.1.7
Requires-Dist: inscriptis==2.3.2
Requires-Dist: loguru==0.7.0
Requires-Dist: multiprocess==0.70.14
Requires-Dist: nltk==3.8.1
Requires-Dist: numpy==1.25.0
Requires-Dist: python-magic==0.4.27
Requires-Dist: trafilatura==1.6.1
Requires-Dist: warcio==1.7.4
Requires-Dist: zstandard==0.21.0
Requires-Dist: pyarrow==12.0.1
Requires-Dist: tokenizers==0.13.3
Requires-Dist: tldextract==3.4.4
Requires-Dist: pandas==2.0.3
Requires-Dist: backoff==2.2.1
Requires-Dist: fsspec==2023.9.2
Requires-Dist: humanize==4.8.0
Requires-Dist: rich==13.7.0
Provides-Extra: dev
Requires-Dist: black~=23.1; extra == "dev"
Requires-Dist: pre-commit>=3.3.3; extra == "dev"
Requires-Dist: pytest>=7.2.0; extra == "dev"
Requires-Dist: pytest-timeout; extra == "dev"
Requires-Dist: pytest-xdist; extra == "dev"
Requires-Dist: ruff<=0.0.259,>=0.0.241; extra == "dev"

# datatrove

## Installation

```bash
pip install -e ".[dev]"
```

Install pre-commit code style hooks:
```bash
pre-commit install
```

Run the tests:
```bash
pytest -n 4  --max-worker-restart=0 --dist=loadfile tests
```
