Metadata-Version: 2.1
Name: drift_shield
Version: 0.5
Summary: A package to monitor and track data drift for ML models
Author: Shanmukh Dara
Author-email: shanmukhdara@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: scipy
Requires-Dist: xgboost
Requires-Dist: sklearn
Requires-Dist: shap

# DriftShield

**DriftShield** is a Python package designed to detect and handle data drift in machine learning pipelines. It compares distributions of numeric and non-numeric data between training and scoring datasets, helps identify drift, and replaces problematic values with predefined defaults. With built-in outlier handling and statistical tests, **DriftShield** ensures that your data remains consistent and prevents performance degradation caused by unseen data changes.

## Features

- Detects data drift in non-numeric, numeric, and boolean columns.
- Handles outliers when calculating means for numeric data.
- Compares 25th, 50th, and 75th percentiles for numeric columns.
- Tracks changes in proportions for boolean columns.
- Provides mechanisms to replace drifted values with default values.
- Customizable exclusion of columns from drift detection.

## Installation

To install **DriftShield**, you can clone the repository and install it using `pip`:

```bash
git clone <>
cd driftshield
pip install .
```

Alternatively, you can install it directly from PyPI (after you’ve published it):

```bash
pip install drift_shield
```

## Usage

**DriftShield** can be used to monitor and handle drift between training and scoring datasets. Here's a quick guide on how to use it:

### 1. Import the package

```python
from drift_shield import data_drift, handle_data_drift
```

### 2. Detect Data Drift

In **training mode**, you can store distinct values and statistics for numeric/boolean columns.

```python
data_drift('my_dataset', 'training', training_df, './buffer_dir', exclusions=['column_to_exclude'])
```

In **scoring mode**, it will compare the statistics from the stored buffer to detect drift.

```python
data_drift('my_dataset', 'scoring', scoring_df, './buffer_dir', exclusions=['column_to_exclude'])
```

### 3. Handle Drift

If drift is detected, you can replace drifted values with values from a default DataFrame.

```python
updated_df = handle_data_drift('my_dataset', scoring_df, './buffer_dir', default_replacements_df, exclusions=['column_to_exclude'])
```

### 4. Delete Drift Dump

To remove a stored drift file if you need to reset or rerun:

```python
from drift_shield import delete_drift_dump

delete_drift_dump('my_dataset', './buffer_dir', type = 'data_drift')
```

### 5. Feature Importance Drift
To track feature importance drift between the training and scoring phases and detect any changes in feature importance.

```python
from drift_shield import feature_importance_drift
```

In **training mode**, you can store feature importance and columns names to the json dump.

```python
feature_importance_drift('my_dataset', 'training', model, df_training, './buffer_dir', target_column='target')
```

In **scoring mode**, this function compares feature importance of training to the scoring data to detect any significant drift. It supports models like RandomForest, XGBoost, and LinearRegression using SHAP values.

```python
feature_importance_drift('my_dataset', 'scoring', model, df_scoring, './buffer_dir')
```

### 6. Monitor Data Volume Over Time

This function tracks the data volume over successive scoring phases and logs any significant changes based on a specified threshold (default is 20%).

```python
monitor_data_volume_over_time('my_dataset', df_scoring, './buffer_dir', threshold=0.2)
```

## Example Workflow

1. **Training Phase:**
   - Store distinct values and statistics:
   ```python
   data_drift('my_training_data', 'training', training_df, './buffer')
   ```

2. **Scoring Phase:**
   - Compare scoring data to the training statistics:
   ```python
   data_drift('my_training_data', 'scoring', scoring_df, './buffer')
   ```

3. **Handling Drift:**
   - Replace drifted values with defaults:
   ```python
   updated_df = handle_data_drift('my_training_data', scoring_df, './buffer', default_replacements_df)
   ```

## To make changes to this package:
1. Clone it make changes, modifty requirements and setup.py, test and validate it
2. Increment the version number in setup.py
3. Go to the root folder of the package, 'pip install .'
4. 'pip install twine'
5. 'python setup.py sdist bdist_wheel'
6. twine upload 'dist/*' then provide your PyPI creds. 
7. And push changes to the git repo
