Metadata-Version: 2.1
Name: datacleanbot
Version: 0.8
Summary: automated data cleaning tool
Home-page: https://github.com/Ji-Zhang/datacleanbot
Author: Ji Zhang
Author-email: 
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Description-Content-Type: text/markdown
Requires-Dist: numpy (>=1.14.2)
Requires-Dist: pandas (<=0.23.1,>=0.19.0)
Requires-Dist: scikit-learn (>=0.20.0)
Requires-Dist: scipy (>=1.2.1)
Requires-Dist: seaborn (>=0.8)
Requires-Dist: matplotlib (>=2.2.2)
Requires-Dist: missingno (==0.4.1)
Requires-Dist: openml (>=0.9.0)
Requires-Dist: fancyimpute (==0.4.0)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

# datacleanbot
Automated Data Cleaning Tool.
The main goal is to develop a Python tool ``datacleanbot`` such that:
    Given a random parsed raw dataset representing a supervised learning problem, the Python tool is capable of automatically identifying the potential issues and reporting the results and recommendations to the end-user in an effective way.

## Install

```sh
$ pip install datacleanbot
```

## QuickStart

### Install OpenML (version 0.9.0):

OpenML is used to easily import datasets and share models and experiments.

```sh
$ pip install openml
```

For Windows, you need to have C++ Compiler installed.

### Acquire data from OpenML:

    >>> import openml as oml
    >>> data = oml.datasets.get_dataset(id) # id: openml dataset id
    >>> X, y, categorical_indicator, features = data.get_data(target=data.default_target_attribute, dataset_format='array')
    >>> Xy = np.concatenate((X,y.reshape((y.shape[0],1))), axis=1)

### Autoclean data with datacleanbot:

    >>> import datacleanbot.dataclean as dc
    >>> Xy = dc.autoclean(Xy, data.name, features)


## Description

``datacleanbot`` is equipped with the following capabilities:
* Present an overview report of the given dataset
    * The most important features
    * Statistical information (e.g., mean, max, min)
    * Data types of features
* Clean common data problems in the raw dataset
    * Duplicated records
    * Inconsistent column names
    * **Missing values**
    * **Outliers**

The two aspects ``datacleanbot`` meaningfully automates are marked in bold.

## User's Guide

The user's guide can be found at [datacleanbot](https://datacleanbot.readthedocs.io/en/latest/).


