Metadata-Version: 2.1
Name: dcbench
Version: 0.0.1
Summary: This is a benchmark that tests various data-centric aspects of improving the quality of machine learning workflows.
Home-page: UNKNOWN
Author: https://github.com/data-centric-ai/dcbench
Author-email: sabri@eyuboglu.us
License: Apache 2.0
Platform: UNKNOWN
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
Requires-Dist: click (>=8.0.0)
Requires-Dist: pyyaml (>=5.4)
Requires-Dist: pre-commit
Requires-Dist: pandas
Requires-Dist: numpy (>=1.18.0)
Requires-Dist: ujson
Requires-Dist: jsonlines (>=1.2.0)
Requires-Dist: tqdm (>=4.49.0)
Requires-Dist: scikit-learn
Provides-Extra: dev
Requires-Dist: black (>=21.5b0) ; extra == 'dev'
Requires-Dist: isort (>=5.7.0) ; extra == 'dev'
Requires-Dist: flake8 (>=3.8.4) ; extra == 'dev'
Requires-Dist: docformatter (>=1.4) ; extra == 'dev'
Requires-Dist: pytest-cov (>=2.10.1) ; extra == 'dev'
Requires-Dist: sphinx-rtd-theme (>=0.5.1) ; extra == 'dev'
Requires-Dist: nbsphinx (>=0.8.0) ; extra == 'dev'
Requires-Dist: recommonmark (>=0.7.1) ; extra == 'dev'
Requires-Dist: parameterized ; extra == 'dev'
Requires-Dist: pre-commit (>=2.9.3) ; extra == 'dev'
Requires-Dist: sphinx-autobuild ; extra == 'dev'


# Data-Centric AI Benchmark

This is a benchmark that tests various data-centric aspects of improving the quality of machine learning workflows.

It features a growing list of *tasks*:

* Minimal data cleaning (`miniclean`)
* Task-specific Label Correction (`labelfix`)
* Discovery of validation Error Modalities (`errmod`)
* Minimal training dataset selection (`minitrain`)

Each task features a collection of *scenarios* which are defined by datasets and ML pipeline elements (e.g. a model, feature pre-processors, etc.)

## Basic Usage

The very first step is to install the PyPI package:

```bash
pip install dcai
```

Then, we advise using Jupyter notebooks or some other interactive environment. You start off by importing the library and listing all the available artefacts:

```python
from dcai import scenarios

scenarios.list()
```

You can then load a specific scenario and view its *artefacts*:

```python
scenario = scenarios.get("miniclean/bank")
scenario.artefacts
```

In the above example we are loading the `bank` scenario of the `miniclean` task. We can then load all the artefacts into a dictionary:

```python
a = scenario.artefacts.load()
```

This automatically downloads all the available artefacts, saves a local copy and loads it into memory. Artefacts can be accessed directly from the dictionary. We can then go ahead and write the code that will provide us with a scenario-specific solution:

```python
model.fit(a["X_train_dirty"], a["y_train"])

X_train_selection = ...
```

Once we have an object (e.g. `X_train_selection`) containing the scenario-specific solution, we can package it into a solution object:

```python
solution = scenario.solve(X_train_selection=X_train_selection)
```

We can then perform an evaluation on that solution that will give us the result:

```python
solution.evaluate()
solution.result
```

After you're happy with the obtained result, you can bundle your solution artefacts and see their location.

```python
solution.save()
solution.location
```

After obtaining the `/path/to/your/artefacts` you can upload it as a bundle to [CodaLab](https://codalab.org/):

```bash
cl upload /path/to/your/artefacts
```

This command will display the URL of your uploaded bundle. It assumes that you have a user account on CodaLab ([click here](https://codalab-worksheets.readthedocs.io/en/latest/features/bundles/uploading/) for more info).

After that, you simply go to our [FORM LINK](#), fill it in with all required details and paste the bundle link so we can run a full evaluation on it.

Congratulations! Your solution is now uploaded to our system and after evaluation it will show up on the [leaderboard](#).

## Adding a Submitted solution to the Repo

This step is performed manually by us (although it could be possible to automate). It looks like this:

```bash
dcai add-solution \
    --scenario miniclean/bank \
    --name MySolution \
    --paper https://arxiv.org/abs/... \
    --code https://github.com/... \
    --artefacts-url https://worksheets.codalab.org/rest/bundles/...
```

## Performing the Full Evaluation

This step is performed by GitHub Actions and is triggered after each commit.

```bash
dcai evaluate --leaderboard-output /path/to/leaderboard/dir
```


