Metadata-Version: 2.1
Name: datatracer
Version: 0.0.5.dev0
Summary: Data Lineage Tracing Library
Home-page: https://github.com/HDI-Project/DataTracer
Author: MIT Data To AI Lab
Author-email: dailabmit@gmail.com
License: MIT license
Keywords: datatracer data-tracer Data Tracer
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Requires-Python: >=3.5,<3.8
Description-Content-Type: text/markdown
Requires-Dist: pandas (<0.25,>=0.23.4)
Requires-Dist: scikit-learn (<0.21,>=0.20.0)
Requires-Dist: numpy (<1.17,>=1.15.2)
Requires-Dist: mlblocks (==0.3.4)
Requires-Dist: metad (==0.0.1)
Requires-Dist: falcon (<3,>=2.0.0)
Requires-Dist: hug (<3,>=2.6.1)
Requires-Dist: pyyaml (<6,>=5.3.1)
Requires-Dist: tqdm (<5,>=4.46.1)
Provides-Extra: dev
Requires-Dist: bumpversion (<0.6,>=0.5.3) ; extra == 'dev'
Requires-Dist: pip (>=9.0.1) ; extra == 'dev'
Requires-Dist: watchdog (<0.11,>=0.8.3) ; extra == 'dev'
Requires-Dist: m2r (<0.3,>=0.2.0) ; extra == 'dev'
Requires-Dist: nbsphinx (<0.7,>=0.5.0) ; extra == 'dev'
Requires-Dist: Sphinx (<3,>=1.7.1) ; extra == 'dev'
Requires-Dist: sphinx-rtd-theme (<0.5,>=0.2.4) ; extra == 'dev'
Requires-Dist: autodocsumm (>=0.1.10) ; extra == 'dev'
Requires-Dist: flake8 (<4,>=3.7.7) ; extra == 'dev'
Requires-Dist: isort (<5,>=4.3.4) ; extra == 'dev'
Requires-Dist: autoflake (<2,>=1.1) ; extra == 'dev'
Requires-Dist: autopep8 (<2,>=1.4.3) ; extra == 'dev'
Requires-Dist: twine (<4,>=1.10.0) ; extra == 'dev'
Requires-Dist: wheel (>=0.30.0) ; extra == 'dev'
Requires-Dist: coverage (<6,>=4.5.1) ; extra == 'dev'
Requires-Dist: tox (<4,>=2.9.1) ; extra == 'dev'
Requires-Dist: pytest (>=3.4.2) ; extra == 'dev'
Requires-Dist: pytest-cov (>=2.6.0) ; extra == 'dev'
Requires-Dist: jupyter (<2,>=1.0.0) ; extra == 'dev'
Requires-Dist: rundoc (<0.5,>=0.4.3) ; extra == 'dev'
Provides-Extra: test
Requires-Dist: pytest (>=3.4.2) ; extra == 'test'
Requires-Dist: pytest-cov (>=2.6.0) ; extra == 'test'
Requires-Dist: jupyter (<2,>=1.0.0) ; extra == 'test'
Requires-Dist: rundoc (<0.5,>=0.4.3) ; extra == 'test'

<p align="left">
<img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt=“DAI-Lab” />
<i>An open source project from Data to AI Lab at MIT.</i>
</p>

[![Development Status](https://img.shields.io/badge/Development%20Status-2%20--%20Pre--Alpha-yellow)](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
[![PyPI Shield](https://img.shields.io/pypi/v/datatracer.svg)](https://pypi.python.org/pypi/datatracer)
[![Downloads](https://pepy.tech/badge/datatracer)](https://pepy.tech/project/datatracer)
[![Run Tests](https://github.com/data-dev/DataTracer/workflows/Run%20Tests/badge.svg)](https://github.com/data-dev/DataTracer/actions)

# DataTracer

Data Lineage Tracing Library

* License: [MIT](https://github.com/data-dev/DataTracer/blob/master/LICENSE)
* Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
* Homepage: https://github.com/data-dev/DataTracer

## Overview

DataTracer is a Python library for solving Data Lineage problems using statistical
methods, machine learning techniques, and hand-crafted heuristics.

Currently the Data Tracer library implements discovery of the following properties:

* **Primary Key**: Identify which column is the primary key in each table.
* **Foreign Key**: Find which relationships exist between the tables.
* **Column Mapping**: Given a field in a table, deduce which other fields, from the same table
  or other tables, are more related or contributed the most in generating the given field.

### REST API

The DataTracer library also incorporates a REST API that enables interaction with the DataTracer
Solvers via HTTP communication. You can check it [here](rest)

# Install

## Requirements

**DataTracer** has been developed and tested on [Python 3.5 and 3.6, 3.7](https://www.python.org/downloads/)

Also, although it is not strictly required, the usage of a [virtualenv](
https://virtualenv.pypa.io/en/latest/) is highly recommended in order to avoid
interfering with other software installed in the system where **DataTracer** is run.

## Install with pip

The easiest and recommended way to install **DataTracer** is using [pip](
https://pip.pypa.io/en/stable/):

```bash
pip install datatracer
```

This will pull and install the latest stable release from [PyPi](https://pypi.org/).

If you want to install from source or contribute to the project please read the
[Contributing Guide](https://hdi-project.github.io/DataTracer/contributing.html#get-started).


# Data Format: Datasets and Metadata

The DataTracer library is prepared to work using datasets, which are a collection of tables
loaded as `pandas.DataFrames` and a MetaData JSON which provides information about the
dataset structure.

You can find more information about the MetaData format in the [MetaData repository](
https://github.com/signals-dev/MetaData).

The DataTracer also includes a few [demo datasets](datatracer/datasets) which you can easily
download to your computer using the `datatracer.get_demo_data` function:

```python3
from datatracer import get_demo_data

get_demo_data()
```

This will create a folder called `datatracer_demo` in your working directory with a few
datasets ready to use inside it.

# Quickstart

In this short tutorial we will guide you through a series of steps that will help you
getting started with **Data Tracer**.

## Load data

The first step will be to load the data in the format expected by DataTracer.

For this, we can use the `datatracer.load_dataset`  function passing the path to
the dataset folder.

For example, if we want to use the `classicmodels` dataset included in the demo folder
that we just created we can load it using:

```python3
from datatracer import load_dataset

metadata, tables = load_dataset('datatracer_demo/classicmodels')
```

This will return a tuple which contains:

* A `MetaData` instance with details about the dataset.
* A `dict` with all the tables of the dataset loaded as a `pandas.DataFrame`.

## Select a Solver

In the DataTracer project, the different Data Lineage problems are solved using what we
call _solvers_.

We can see the list of available solvers using the `get_solvers` function:

```python3
from datatracer import get_solvers

get_solvers()
```

which will return a list with their names:

```
['datatracer.column_map',
 'datatracer.foreign_key.basic',
 'datatracer.foreign_key.standard',
 'datatracer.primary_key.basic']
```

## Use a DataTracer instance to find table relationships

In order to use the selected solver you will need to load it using the `DataTracer` class.

In this example, we will try to figure out the relationships between the tables in our dataset
by using the solver `datatracer.foreign_key.standard`.

```python3
from datatracer import DataTracer

# Load the Solver
solver = DataTracer.load('datatracer.foreign_key.standard')

# Solve the Data Lineage problem
foreign_keys = solver.solve(tables)
```

The result will be a dictionary containing the foreign key candidates:

```
[{'table': 'products',
  'field': 'productLine',
  'ref_table': 'productlines',
  'ref_field': 'productLine'},
 {'table': 'payments',
  'field': 'customerNumber',
  'ref_table': 'customers',
  'ref_field': 'customerNumber'},
 {'table': 'orders',
  'field': 'customerNumber',
  'ref_table': 'customers',
  'ref_field': 'customerNumber'},
 {'table': 'orderdetails',
  'field': 'productCode',
  'ref_table': 'products',
  'ref_field': 'productCode'},
 {'table': 'orderdetails',
  'field': 'orderNumber',
  'ref_table': 'orders',
  'ref_field': 'orderNumber'},
 {'table': 'employees',
  'field': 'officeCode',
  'ref_table': 'offices',
  'ref_field': 'officeCode'}]
```

# What's next?

You can learn more about the DataTracer features in the [notebook tutorials](tutorials).

Also don't forget to have a look at the DataTracer [REST API](rest).


# History

## 0.0.4 - 2020-06-05

* Add initial version of pretrained solvers
* Reorganize ColumnMapSolver code tree
* Add REST API to access DataTracer solvers via HTTP

## 0.0.3 - 2020-05-28

* Finish Column Mapping and add tutorial
* Minor refactoring and adding docstrings
* Fix testing config

## 0.0.2 - 2020-05-26

* Curate configuration and dependencies

## 0.0.1 - 2020-05-22

First release.

Features:

* Primary Key Detection
* Foreign Key Detection
* Column Mapping


