Metadata-Version: 2.1
Name: datavalid
Version: 0.0.1
Summary: Data validation library
Home-page: https://github.com/pckhoi/datavalid
Author: Khoi Pham
Author-email: pckhoi@gmail.com
License: UNKNOWN
Project-URL: Bug Tracker, https://github.com/pckhoi/datavalid/issues
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: numpy (>=1.18)
Requires-Dist: pandas (>=1.2)
Requires-Dist: pyyaml (>=5.4.1)
Requires-Dist: termcolor (>=1.1.0)

# Datavalid

This library allow you to declare validation tasks to check for CSV files. This ensure data correctness for ETL pipeline that update frequently.

## Installation

```bash
pip install datavalid
```

## Usage

Create a `datavalid.yml` file in your data folder:

```yaml
files:
  fuse/complaint.csv:
    - name: `complaint_uid` should be unique per `allegation` x `uid`
      unique:
        - complaint_uid
        - uid
        - allegation
    - name: if `allegation_finding` is "sustained" then `disposition` should also be "sustained"
      empty:
        and:
          - column: allegation_finding
            op: equal
            value: sustained
          - column: disposition
            op: not_equal
            value: sustained
  fuse/event.csv:
    - name: no officer with consecutive left date
      where:
        column: kind
        op: equal
        value: officer_left
      group_by: uid
      no_consecutive_date:
        date_from:
          year: year
          month: month
          day: day
```

Then run datavalid command in that folder:

```bash
python -m datavalid
```

You can also specify a data folder that isn't the current working directory:

```bash
python -m datavalid --dir my_data_folder
```


