Metadata-Version: 2.1
Name: AutoPrep
Version: 2.0.0
Summary: AutoPrep is an automated preprocessing pipeline with univariate anomaly marking
Author-email: Jörg Adelhelm <adeljoe@gmx.de>
License: MIT
Project-URL: Homepage, https://github.com/JAdelhelm/AutoPrep
Project-URL: Issues, https://github.com/JAdelhelm/AutoPrep/issues
Keywords: anomaly-detection,preprocessing,automated,automated-preprocessing,cleaning
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: scikit-learn
Requires-Dist: category_encoders
Requires-Dist: bitstring
Requires-Dist: pandas
Requires-Dist: numpy

# AutoPrep -  Automated Preprocessing Pipeline with Univariate Anomaly Marking
![PyPIv](https://img.shields.io/pypi/v/AutoPrep)
![PyPI status](https://img.shields.io/pypi/status/AutoPrep)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/AutoPrep) ![PyPI - License](https://img.shields.io/pypi/l/AutoPrep)
<!-- [![Downloads](https://static.pepy.tech/badge/AutoPrep)](https://pepy.tech/project/AutoPrep) -->



This pipeline focuses on data preprocessing, standardization, and cleaning, with additional features to identify univariate anomalies.
<a href="https://html-preview.github.io/?url=https://raw.githubusercontent.com/JAdelhelm/AutoPrep/main/visualization/PipelineStructure.html" target="_blank">Structure of Preprocessing Pipeline</a>

```python
pip install AutoPrep
```
#### Dependencies
- scikit-learn
- category_encoders
- bitstring
- ydata_profiling


## Basic Usage
To utilize this pipeline, you need to import the necessary libraries and initialize the AutoPrep pipeline. Here is a basic example:

````python
############## dummy data #############
import pandas as pd
data = {
    'ID': [1, 2, 3, 4],                 
    'Name': ['Alice', 'Bob', 'Charlie', 'R2D2'],  
    'Age': [25, 30, 35, 90],                 
    'Salary': [50000.00, 60000.50, 75000.75, 80000.00], 
    'Hire Date': pd.to_datetime(['2020-01-15', '2019-05-22', '2018-08-30', '2021-04-12']), 
    'Is Manager': [False, True, False, True]  
}
data = pd.DataFrame(data)
########################################


from Autoprep import AutoPrep

pipeline = AutoPrep(
    nominal_columns=["ID", "Name", "Is Manager"],
    datetime_columns=["Hire Date"],
    pattern_recognition_columns=["Name"]

)
X_output = pipeline.preprocess(df=data)

# pipeline.get_profiling(X=data)
# pipeline.visualize_pipeline_structure_html()
````

The resulting output dataframe can be accessed by using:

````python
X_output

> Output:
    col_1  col_2  ...   col_n
1   data   ...    ...   data   
2   data   ...    ...   data  
... ...    ...    ...   ...   
````

## Highlights ⭐


#### 📌 Implementation of univariate methods / *Detection of univariate anomalies*
Both methods (MOD Z-Value and Tukey Method) are resilient against outliers, ensuring that the position measurement will not be biased. They also support multivariate anomaly detection algorithms in identifying univariate anomalies.

#### 📌 BinaryEncoder instead of OneHotEncoder for nominal columns / *Big Data and Performance*
   Newest research shows similar results for encoding nominal columns with significantly fewer dimensions.
   - (John T. Hancock and Taghi M. Khoshgoftaar. "Survey on categorical data for neural networks." In: Journal of Big Data 7.1 (2020), pp. 1–41.), Tables 2, 4
   - (Diogo Seca and João Mendes-Moreira. "Benchmark of Encoders of Nominal Features for Regression." In: World Conference on Information Systems and Technologies. 2021, pp. 146–155.), P. 151

#### 📌 Transformation of time series data and standardization of data with RobustScaler / *Normalization for better prediction results*

#### 📌 Labeling of NaN values in an extra column instead of removing them / *No loss of information*



---





## Pipeline - Built-in Logic
<!-- ![Logic of Pipeline](./images/decision_rules.png) -->
![Logic of Pipeline](https://raw.githubusercontent.com/JAdelhelm/AutoPrep/main/AutoPrep/img/decision_rules.png) 





<!-- ## Abstract View (Code Structure) -->
<!-- ![Abstract view of the project](./images/project.png) -->
<!-- ![Abstract view of the project](https://raw.githubusercontent.com/JAdelhelm/AutoPrep/main/images/project.png) -->




---

---

## Feel free to contribute 🙂

### Reference
- https://www.researchgate.net/publication/379640146_Detektion_von_Anomalien_in_der_Datenqualitatskontrolle_mittels_unuberwachter_Ansatze (German Thesis)

### Further Information

- I used sklearn's Pipeline and Transformer concept to create this preprocessing pipeline
    - Pipeline: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
    - Transformer: https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html



















