Metadata-Version: 2.1
Name: upgini
Version: 0.10.0a106
Summary: Features search library for supervised machine learning on tabular data
Home-page: https://upgini.com/
Author: Upgini Developers
Author-email: madewithlove@upgini.com
License: BSD 3-Clause License
Project-URL: Bug Reports, https://github.com/upgini/upgini/issues
Project-URL: Source, https://github.com/upgini/upgini
Keywords: data science,machine learning,data mining,automl,data search
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Customer Service
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Telecommunications Industry
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.7,<4
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: python-dateutil (>=2.8.0)
Requires-Dist: requests (>=2.8.0)
Requires-Dist: pandas (>=1.1.0)
Requires-Dist: numpy (>=1.19.0)
Requires-Dist: imbalanced-learn (>=0.9.0)
Requires-Dist: pydantic (>=1.8.2)
Requires-Dist: fastparquet (>=0.7.1)
Requires-Dist: yaspin (>=2.1.0)
Requires-Dist: python-json-logger (>=2.0.2)
Requires-Dist: catboost (>=1.0.3)
Requires-Dist: lightgbm (>=3.0.0)

<h2 align="center"> <a href="https://upgini.com/">Upgini</a> : low-code feature search and enrichment library for machine learning</h2>
<p align="center"> <b>Automatically searches through thousands of ready-to-use features from public and community data sources and enriches your dataset with new external features in minutes</b> </p>
<p align="center">
	<br />
    <a href="https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb"><strong>Quick Start in Colab »</strong></a> |
    <a href="https://upgini.com/">Upgini.com</a> |
    <a href="https://profile.upgini.com">Sign In</a> |
    <a href="https://upgini.slack.com/messages/C02MW49ADSN">Slack Community</a> 
 </p>

[![license](https://img.shields.io/badge/license-BSD--3%20Clause-green)](/LICENSE)
[![Python version](https://img.shields.io/badge/python_version-3.8-red?logo=python&logoColor=white)](https://www.python.org/downloads/release/python-380/)
[![PyPI Latest Release](https://img.shields.io/badge/pypi-v0.10.0-blue?logo=pypi&logoColor=white)](https://pypi.org/project/upgini/)
[![stability-release-candidate](https://img.shields.io/badge/stability-pre--release-br?logo=circleci&logoColor=white)](https://pypi.org/project/upgini/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?logo=python&logoColor=white)](https://github.com/psf/black)
[![Slack upgini](https://img.shields.io/badge/slack-@upgini-orange.svg?logo=slack)](https://upgini.slack.com/messages/C02MW49ADSN)
[![Downloads](https://pepy.tech/badge/upgini)](https://pepy.tech/project/upgini)
## ❔ Overview

**Upgini** is a simple feature search & enrichment library in Python. With Upgini, you spend less time for external data search and feature engineering, which will be done for you automatically. Just use your labeled dataset to initiate search through thousands of features and data sources, including public datasets and scraped data shared by Data science community. Only features that improve the prediction power of your ML model are returned.  
**Motivation:** for most supervised ML models external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient search tools for external data blocks massive adoption of external features in ML pipelines.  
We want radically simplify features search and delivery for ML pipelines to make external data a standard approach. Like a hyperparameter tuning for machine learning nowadays.  
**Mission:** Democratize access to data sources for data science community

## 🚀 Awesome features
⭐️ Find only features that *give accuracy improvement* according to accuracy metric: ROC AUC, RMSE, MAE, Accuracy, etc. Not just correlated with target variable, which 9 out of 10 cases gives zero accuracy improvement for production ML cases  
⭐️ Calculate *accuracy metrics and uplifts* if you'll enrich your existing ML model with external features   
⭐️ Check the stability of accuracy gain from external data on out-of-time intervals and verification datasets. Mitigate risks of unstable external data dependencies in ML pipeline   
⭐️ Scikit-learn compatible interface for quick data integration with your existing ML pipelines  
⭐️ Curated and updated data sources, including public datasets and community shared data  
⭐️ Support for several search key types (including <i>**date/datetime, country, postal/ZIP code, SHA256 hashed email, IPv4, phone**</i>), more to come...  
⭐️ Supported supervised ML tasks:  
  - ☑️ [binary classification](https://en.wikipedia.org/wiki/Binary_classification)  
  - ☑️ [multiclass classification](https://en.wikipedia.org/wiki/Multiclass_classification)  
  - ☑️ [regression](https://en.wikipedia.org/wiki/Regression_analysis)  
  - ☑️ [time series prediction](https://en.wikipedia.org/wiki/Time_series#Prediction_and_forecasting)   
  - 🔜 [recommender system](https://en.wikipedia.org/wiki/Recommender_system)  
## 🏁 Quick start and guides

### 1. Quick start guide

Search **new features** for  Kaggle [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only).   The goal is to predict future sales of different goods in stores based on a 5-year history of sales. The evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).  
Run [quick start guide notebook](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb) inside your browser:

[![Open example in Google Colab](https://img.shields.io/badge/run_example_in-colab-blue?style=for-the-badge&logo=googlecolab)](https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)
&nbsp;
[![Open in Binder](https://img.shields.io/badge/run_example_in-mybinder-red.svg?style=for-the-badge&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFkAAABZCAMAAABi1XidAAAB8lBMVEX///9XmsrmZYH1olJXmsr1olJXmsrmZYH1olJXmsr1olJXmsrmZYH1olL1olJXmsr1olJXmsrmZYH1olL1olJXmsrmZYH1olJXmsr1olL1olJXmsrmZYH1olL1olJXmsrmZYH1olL1olL0nFf1olJXmsrmZYH1olJXmsq8dZb1olJXmsrmZYH1olJXmspXmspXmsr1olL1olJXmsrmZYH1olJXmsr1olL1olJXmsrmZYH1olL1olLeaIVXmsrmZYH1olL1olL1olJXmsrmZYH1olLna31Xmsr1olJXmsr1olJXmsrmZYH1olLqoVr1olJXmsr1olJXmsrmZYH1olL1olKkfaPobXvviGabgadXmsqThKuofKHmZ4Dobnr1olJXmsr1olJXmspXmsr1olJXmsrfZ4TuhWn1olL1olJXmsqBi7X1olJXmspZmslbmMhbmsdemsVfl8ZgmsNim8Jpk8F0m7R4m7F5nLB6jbh7jbiDirOEibOGnKaMhq+PnaCVg6qWg6qegKaff6WhnpKofKGtnomxeZy3noG6dZi+n3vCcpPDcpPGn3bLb4/Mb47UbIrVa4rYoGjdaIbeaIXhoWHmZYHobXvpcHjqdHXreHLroVrsfG/uhGnuh2bwj2Hxk17yl1vzmljzm1j0nlX1olL3AJXWAAAAbXRSTlMAEBAQHx8gICAuLjAwMDw9PUBAQEpQUFBXV1hgYGBkcHBwcXl8gICAgoiIkJCQlJicnJ2goKCmqK+wsLC4usDAwMjP0NDQ1NbW3Nzg4ODi5+3v8PDw8/T09PX29vb39/f5+fr7+/z8/Pz9/v7+zczCxgAABC5JREFUeAHN1ul3k0UUBvCb1CTVpmpaitAGSLSpSuKCLWpbTKNJFGlcSMAFF63iUmRccNG6gLbuxkXU66JAUef/9LSpmXnyLr3T5AO/rzl5zj137p136BISy44fKJXuGN/d19PUfYeO67Znqtf2KH33Id1psXoFdW30sPZ1sMvs2D060AHqws4FHeJojLZqnw53cmfvg+XR8mC0OEjuxrXEkX5ydeVJLVIlV0e10PXk5k7dYeHu7Cj1j+49uKg7uLU61tGLw1lq27ugQYlclHC4bgv7VQ+TAyj5Zc/UjsPvs1sd5cWryWObtvWT2EPa4rtnWW3JkpjggEpbOsPr7F7EyNewtpBIslA7p43HCsnwooXTEc3UmPmCNn5lrqTJxy6nRmcavGZVt/3Da2pD5NHvsOHJCrdc1G2r3DITpU7yic7w/7Rxnjc0kt5GC4djiv2Sz3Fb2iEZg41/ddsFDoyuYrIkmFehz0HR2thPgQqMyQYb2OtB0WxsZ3BeG3+wpRb1vzl2UYBog8FfGhttFKjtAclnZYrRo9ryG9uG/FZQU4AEg8ZE9LjGMzTmqKXPLnlWVnIlQQTvxJf8ip7VgjZjyVPrjw1te5otM7RmP7xm+sK2Gv9I8Gi++BRbEkR9EBw8zRUcKxwp73xkaLiqQb+kGduJTNHG72zcW9LoJgqQxpP3/Tj//c3yB0tqzaml05/+orHLksVO+95kX7/7qgJvnjlrfr2Ggsyx0eoy9uPzN5SPd86aXggOsEKW2Prz7du3VID3/tzs/sSRs2w7ovVHKtjrX2pd7ZMlTxAYfBAL9jiDwfLkq55Tm7ifhMlTGPyCAs7RFRhn47JnlcB9RM5T97ASuZXIcVNuUDIndpDbdsfrqsOppeXl5Y+XVKdjFCTh+zGaVuj0d9zy05PPK3QzBamxdwtTCrzyg/2Rvf2EstUjordGwa/kx9mSJLr8mLLtCW8HHGJc2R5hS219IiF6PnTusOqcMl57gm0Z8kanKMAQg0qSyuZfn7zItsbGyO9QlnxY0eCuD1XL2ys/MsrQhltE7Ug0uFOzufJFE2PxBo/YAx8XPPdDwWN0MrDRYIZF0mSMKCNHgaIVFoBbNoLJ7tEQDKxGF0kcLQimojCZopv0OkNOyWCCg9XMVAi7ARJzQdM2QUh0gmBozjc3Skg6dSBRqDGYSUOu66Zg+I2fNZs/M3/f/Grl/XnyF1Gw3VKCez0PN5IUfFLqvgUN4C0qNqYs5YhPL+aVZYDE4IpUk57oSFnJm4FyCqqOE0jhY2SMyLFoo56zyo6becOS5UVDdj7Vih0zp+tcMhwRpBeLyqtIjlJKAIZSbI8SGSF3k0pA3mR5tHuwPFoa7N7reoq2bqCsAk1HqCu5uvI1n6JuRXI+S1Mco54YmYTwcn6Aeic+kssXi8XpXC4V3t7/ADuTNKaQJdScAAAAAElFTkSuQmCC)](https://mybinder.org/v2/gh/upgini/upgini/HEAD?labpath=notebooks%2Fkaggle_example.ipynb)
&nbsp;
<!--
[![Open example in Gitpod](https://img.shields.io/badge/run_example_in-gitpod-orange?style=for-the-badge&logo=gitpod)](https://gitpod.io/#/github.com/upgini/upgini)
-->
Competition dataset was split into train (2013-2016 year) and test (2017 year) parts. `FeaturesEnricher` was fitted on train part. And both datasets  were enriched with external features. Finally, ML model was fitted both of the initial and the enriched datasets to compare accuracy improvement. With a solid improvement of the evaluation metric achieved by the enriched ML model.

### 2. [Kaggle public kernel for Tabular playground series Jan 2022](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022)

Work in progress..

## Install  

### 🐍 Install from PyPI
```python
%pip install upgini
```
<details>
	<summary>
	🐳 <b>Docker-way</b>
	</summary>
</br>
Clone <i>$ git clone https://github.com/upgini/upgini</i> or download upgini git repo locally </br>
and follow steps below to build docker container 👇 </br>
Build docker image</br>
</br>  
 - ... from cloned git repo:</br>
<i>cd upgini </br>
docker build -t upgini .</i></br>
 - ...or directly from GitHub:</br>
</br>
<i>DOCKER_BUILDKIT=0 docker build -t upgini</i></br> <i>git@github.com:upgini/upgini.git#main</i></br>
</br>
Run docker image:</br>
<i>
docker run -p 8888:8888 upgini</br>
</i></br>
Open http://localhost:8888?token="<"your_token_from_console_output">" in your browser  
</details>

## 🌎 Connected data sources and coverage 
We have [two types of data sources](https://upgini.com/#data_sources) with pre-computed features: Public data and Community shared data:
- **Public data** is available from the public sector, academic institutions, and other sources through open data portals  
- **Community shared data** is a royalty / license free datasets or features from Data science community (our users). It's both a public and a scraped data.
#### 📊 Data coverage and statistics
Total: **239 countries** and **up to 41 years** of history
|Data scource|Countries|History, years|
|--|--|--|
|Historical weather & Weather forecast by postal/ZIP code| 68 |12|
|International holidays & events, workweek calendar| 232 |22|
|Consumer Confidence index| 44 |22|
|World economic indicators|191 |41|
|Markets data|-|17|
|World demographic data by postal/ZIP code|60|-
|Public social media profile data for email & phone|104|-
|World mobile network coverage by postal/ZIP code|167|-
|Geolocation profile for phone & IPv4 & email|239|-
|World house prices by postal/ZIP code|44|-
|🔜 Email/WWW domain profile|-|-

👉 More details on [datasets and features here](https://upgini.com/#data_sources)

## 💻 How it works?

### 1. 💡 Use your labeled training dataset for search
You can use your labeled training datasets "as is" to initiate the search. Under the hood, we'll search for relevant data using:
- *[search keys](#-search-key-types-we-support-more-is-coming)* from training dataset to match records from potential data sources with a new features
- *labels* from training dataset to estimate relevancy of feature or dataset for your ML task and calculate feature importance metrics  
- *your features* from training dataset to find external datasets and features only give accuracy improvement to your existing data and estimate accuracy uplift ([optional](#-optional-find-datasets-and-features-only-give-accuracy-gain-to-your-existing-data-in-the-ml-model))  

Load training dataset into pandas dataframe and separate features' columns from label column in a Scikit-learn way:  
```python
import pandas as pd
# labeled training dataset - customer_churn_prediction_train.csv
train_df = pd.read_csv("customer_churn_prediction_train.csv")
X = train_df.drop(columns="churn_flag")
y = train_df["churn_flag"]
```
### 2. 🔦 Choose at least one column as a search key
*Search keys* columns will be used to match records from all potential external data sources / features 👓. Define at least one search key with `FeaturesEnricher` class initialization.  
```python
from upgini import FeaturesEnricher, SearchKey
enricher = FeaturesEnricher(search_keys={"subscription_activation_date": SearchKey.DATE})
```
#### ✨ Search key types we support (more is coming!)
Our team works hard to introduce new search key types, currently we support:
<table style="table-layout: fixed; text-align: left">
  <tr>
    <th> Search Key<br/>Meaning Type </th>
    <th> Description </th>
    <th> Example </th>
  </tr>
  <tr>
    <td> SearchKey.EMAIL </td>
    <td> e-mail </td>
    <td> <tt>support@upgini.com </tt> </td>
  </tr>
  <tr>
    <td> SearchKey.HEM </td>
    <td>  <tt>sha256(lowercase(email)) </tt> </td>
    <td> <tt>0e2dfefcddc929933dcec9a5c7db7b172482814e63c80b8460b36a791384e955 </tt> </td>
  </tr>
  <tr>
    <td> SearchKey.IP </td>
    <td> IP address (version 4) </td>
    <td> <tt>192.168.0.1 </tt> </td>
  </tr>
  <tr>
    <td> SearchKey.PHONE </td>
    <td> phone number, <a href="https://en.wikipedia.org/wiki/E.164">E.164 standard</a> </td>
    <td> <tt>443451925138 </tt> </td>
  </tr>
  <tr>
    <td> SearchKey.DATE </td>
    <td> date </td>
    <td> 
      <tt>2020-02-12 </tt>&nbsp;(<a href="https://en.wikipedia.org/wiki/ISO_8601">ISO-8601 standard</a>) 
      <br/> <tt>12.02.2020 </tt>&nbsp;(non standard notation) 
    </td>
  </tr>
  <tr>
    <td> SearchKey.DATETIME </td>
    <td> datetime </td>
    <td> <tt>2020-02-12 12:46:18 </tt> <br/> <tt>12:46:18 12.02.2020 </tt> <br/> <tt>unixtimestamp </tt> </td>
  </tr>
  <tr>
    <td> SearchKey.COUNTRY </td>
    <td> <a href="https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes">Country code</a> </td>
    <td> <tt>GB </tt> <br/> <tt>US </tt> <br/> <tt>IN </tt> </td>
  </tr> 
  <tr>
    <td> SearchKey.POSTAL_CODE </td>
    <td> Postal code a.k.a. ZIP code. Could be used only with SearchKey.COUNTRY  </td>
    <td> <tt>21174 </tt> <br/> <tt>061107 </tt> <br/> <tt>SE-999-99 </tt> </td>
  </tr>
</table>

#### ⚠️ Requirements for search initialization dataset  
We do dataset verification and cleaning under the hood, but still there are some requirements to follow:  
- Pandas dataframe representation  
- Correct label column types: boolean/integers/strings for binary and multiclass labels, floats for regression  
- At least one column defined as a [search key](#-search-key-types-we-support-more-is-coming)  
- Min size after deduplication by search key column and NaNs removal: *100 records*  

### 3. 🔍 Start your first feature search!
The main abstraction you interact is `FeaturesEnricher`. `FeaturesEnricher` is a Scikit-learn compatible estimator, so you can easily add it into your existing ML pipelines. First, create instance of the `FeaturesEnricher` class. Once it created call  
- `fit` to search relevant datasets & features  
- than `transform` to enrich your dataset with features from search result  

Let's try it out!
```python
import pandas as pd
from upgini import FeaturesEnricher, SearchKey

# load labeled training dataset to initiate search
train_df = pd.read_csv("customer_churn_prediction_train.csv")
X = train_df.drop(columns="churn_flag")
y = train_df["churn_flag"]

# now we're going to create `FeaturesEnricher` class
enricher = FeaturesEnricher(search_keys={"subscription_activation_date": SearchKey.DATE})

# everything is ready to fit! For 200к records fitting should take around 10 minutes,
# we send email notification, just register on upgini.com
enricher.fit(X, y)
```

That's all). We have fitted `FeaturesEnricher` and any pandas dataframe, with exactly the same data schema, can be enriched with features from search results. Use `transform` method, and let magic to do the rest 🪄

```python
# load dataset for enrichment
test_x = pd.read_csv("test.csv")
# enrich it!
enriched_test_features = enricher.transform(test_x)
enriched_test_features.head()
```
### 4. 📈 Evaluate feature importances (SHAP values) from the search result

`FeaturesEnricher` class has two properties for feature importances, which will be filled after fit - `feature_names_` and `feature_importances_`:  
- `feature_names_` -  feature names from the search result, and if parameter `keep_input=True` was used, initial columns from search dataset as well  
- `feature_importances_` - SHAP values for features from the search result, same order as in `feature_names_`  

And also has method `get_features_info()` which will return pandas dataframe with features and full statistics after fit, including SHAP values and match rates:
```python
enricher.get_features_info()
```

You can get more details about `FeaturesEnricher` at runtime using docstrings, for example, via `help(FeaturesEnricher)` or `help(FeaturesEnricher.fit)`.

### 🧹 Search dataset validation
We validate and clean search initialization dataset under the hood:  
✂️ Check you *search keys* columns format  
✂️ Check zero variance for label column  
✂️ Check dataset for full row duplicates. If we find any, we remove duplicated rows and make a note on share of row duplicates  
✂️ Check inconsistent labels  - rows with the same features and keys but different labels, we remove them and make a note on share of row duplicates  
✂️ Remove columns with zero variance  - we treat any non *search key* column in search dataset as a feature, so columns with zero variance will be removed

### ❔ Supervised ML tasks detection
We detect ML task under the hood based on label column values. Currently we support:  
  - ModelTaskType.BINARY
  - ModelTaskType.MULTICLASS 
  - ModelTaskType.REGRESSION  

In most cases, you don't need to do anything, but for certain search datasets, this detection might fail.  
In this case, you can pass parameter to `FeaturesEnricher` with correct ML taks type:
```python
from upgini import ModelTaskType
enricher = FeaturesEnricher(
	search_keys={"subscription_activation_date": SearchKey.DATE},
	model_task_type=ModelTaskType.REGRESSION
)
```
#### ⏰ Time Series prediction support  
Time series prediction supported as `ModelTaskType.REGRESSION` or `ModelTaskType.BINARY` tasks with time series specific cross-validation split:
* [Scikit-learn time series cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) - `CVType.time_series` parameter
* [Blocked time series cross-validation](https://goldinlocks.github.io/Time-Series-Cross-Validation/#Blocked-and-Time-Series-Split-Cross-Validation) - `CVType.blocked_time_series` parameter

To initiate feature search for *time series prediction*, you can pass cross-validation type parameter to `FeaturesEnricher` with time series specific CV type:
```python
from upgini.metadata import CVType
enricher = FeaturesEnricher(
	search_keys={"sales_date": SearchKey.DATE},
	cv=CVType.time_series
)
```
⚠️ **Pre-process search dataset** in case of time series prediction:  
Sort rows in dataset according to observation order, in most cases - ascending order by date/datetime

### 🆙 Accuracy and uplift metrics calculations
`FeaturesEnricher` automaticaly calculates model metrics and uplift from new relevant features either using `calculate_metrics()` method or `calculate_metrics=True` parameter in `fit` or `fit_transform` methods (example below).  
You can use any model estimator with scikit-learn compartible interface, some examples are:
* [All Scikit-Learn supervised models](https://scikit-learn.org/stable/supervised_learning.html)
* [Xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn)
* [LightGBM](https://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api)
* [CatBoost](https://catboost.ai/en/docs/concepts/python-quickstart)

<details>
	<summary>
		Evaluation metric should be passed to <i>calculate_metrics()</i> by <i>scoring</i>  parameter,<br/>   
		out-of-the box Upgini supports 👉
	</summary>
<table style="table-layout: fixed;">
  <tr>
    <th>Metric</th>
    <th>Description</th>
  </tr>
  <tr>
    <td><tt>explained_variance</tt></td>
    <td>Explained variance regression score function</td>
  </tr>
  <tr>
    <td><tt>r2</tt></td>
    <td>R<sup>2</sup> (coefficient of determination) regression score function</td>
  </tr>
  <tr>
    <td><tt>max_error</tt></td>
    <td>Calculates the maximum residual error (negative - greater is better)</td>
  </tr>
  <tr>
    <td><tt>median_absolute_error</tt></td>
    <td>Median absolute error regression loss</td>
  </tr>
  <tr>
    <td><tt>mean_absolute_error</tt></td>
    <td>Mean absolute error regression loss</td>
  </tr>
  <tr>
    <td><tt>mean_absolute_percentage_error</tt></td>
    <td>Mean absolute percentage error regression loss</td>
  </tr>
  <tr>
    <td><tt>mean_squared_error</tt></td>
    <td>Mean squared error regression loss</td>
  </tr>
  <tr>
	  <td><tt>mean_squared_log_error</tt> (or aliases: <tt>msle</tt>, <tt>MSLE</tt>)</td>
    <td>Mean squared logarithmic error regression loss</td>
  </tr>
  <tr>
    <td><tt>root_mean_squared_log_error</tt> (or aliases: <tt>rmsle</tt>, <tt>RMSLE</tt>)</td>
    <td>Root mean squared logarithmic error regression loss</td>
  </tr>
  <tr>
    <td><tt>root_mean_squared_error</tt></td>
    <td>Root mean squared error regression loss</td>
  </tr>
  <tr>
    <td><tt>mean_poisson_deviance</tt></td>
    <td>Mean Poisson deviance regression loss</td>
  </tr>
  <tr>
    <td><tt>mean_gamma_deviance</tt></td>
    <td>Mean Gamma deviance regression loss</td>
  </tr>
  <tr>
    <td><tt>accuracy</tt></td>
    <td>Accuracy classification score</td>
  </tr>
  <tr>
    <td><tt>top_k_accuracy</tt></td>
    <td>Top-k Accuracy classification score</td>
  </tr>
  <tr>
    <td><tt>roc_auc</tt></td>
    <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
    from prediction scores</td>
  </tr>
  <tr>
    <td><tt>roc_auc_ovr</tt></td>
    <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
    from prediction scores (multi_class="ovr")</td>
  </tr>
  <tr>
    <td><tt>roc_auc_ovo</tt></td>
    <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
    from prediction scores (multi_class="ovo")</td>
  </tr>
  <tr>
    <td><tt>roc_auc_ovr_weighted</tt></td>
    <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
    from prediction scores (multi_class="ovr", average="weighted")</td>
  </tr>
  <tr>
    <td><tt>roc_auc_ovo_weighted</tt></td>
    <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
    from prediction scores (multi_class="ovo", average="weighted")</td>
  </tr>
  <tr>
    <td><tt>balanced_accuracy</tt></td>
    <td>Compute the balanced accuracy</td>
  </tr>
  <tr>
    <td><tt>average_precision</tt></td>
    <td>Compute average precision (AP) from prediction scores</td>
  </tr>
  <tr>
    <td><tt>log_loss</tt></td>
    <td>Log loss, aka logistic loss or cross-entropy loss</td>
  </tr>
  <tr>
    <td><tt>brier_score</tt></td>
    <td>Compute the Brier score loss</td>
  </tr>
</table>
</details>

In addition to that list, you can define custom evaluation metric function using [scikit-learn make_scorer](https://scikit-learn.org/0.15/modules/model_evaluation.html#defining-your-scoring-strategy-from-score-functions), for example [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error)

By default, `calculate_metrics()` method calculates evaluation metric with the same cross-validation split as selected for `FeaturesEnricher.fit()` by parameter `cv = CVType.<cross-validation-split>`   
But you can easily define new split by passing child of BaseCrossValidator to parameter `cv` in `calculate_metrics()`

Example with more tips-and-tricks:
```python
from upgini import FeaturesEnricher, SearchKey
enricher = FeaturesEnricher(search_keys={"registration_date": SearchKey.DATE})

# Fit with default setup for metrics calculation
# CatBoost will be used
enricher.fit(X, y, eval_set=eval_set, calculate_metrics=True)

# LightGBM estimator for metrics. X and y - same as for fit
custom_estimator = LGBMRegressor()
enricher.calculate_metrics(X, y, eval_set, estimator=custom_estimator)

# Custom metric function to scoring param (callable or name)
custom_scoring = "RMSLE"
enricher.calculate_metrics(X, y, eval_set, scoring=custom_scoring)

# Custom cross validator
custom_cv = TimeSeriesSplit(n_splits=5)
enricher.calculate_metrics(X, y, eval_set, cv=custom_cv)

# All this custom parameters could be combined in both methods: fit, fit_transform and calculate_metrics:
enricher.fit(X, y, eval_set, calculate_metrics=True, estimator=custom_estimator, scoring=custom_scoring, cv=custom_cv)
```

### ✅ Optional: find features only give accuracy gain to existing data in the ML model
If you already have features or other external data sources, you can specifically search new datasets & features only give accuracy gain "on top" of them.  
Just leave all these existing features in the labeled training dataset and Upgini library automatically use them during feature search process and as a baseline ML model to calculate accuracy metric uplift. And won't return any features that might not give an accuracy gain to the existing feature space.  

### ✅ Optional: check robustness of accuracy improvement from external features
You can validate external features robustness on out-of-time dataset using `eval_set` parameter. Let's do that:
```python
# load train dataset
train_df = pd.read_csv("train.csv")
train_ids_and_features = train_df.drop(columns="label")
train_label = train_df["label"]

# load out-of-time validation dataset
eval_df = pd.read_csv("validation.csv")
eval_ids_and_features = eval_df.drop(columns="label")
eval_label = eval_df["label"]
# create FeaturesEnricher
enricher = FeaturesEnricher(search_keys={"registration_date": SearchKey.DATE})

# now we fit WITH eval_set parameter to calculate accuracy metrics on Out-of-time dataset.
# the output will contain quality metrics for both the training data set and
# the eval set (validation OOT data set)
enricher.fit(
  train_ids_and_features,
  train_label,
  eval_set = [(eval_ids_and_features, eval_label)]
)
```
#### ⚠️ Requirements for out-of-time dataset  
- Same data schema as for search initialization dataset  
- Pandas dataframe representation
  
### ✅ Optional: return initial dataframe enriched with TOP external features by importance
`FeaturesEnricher` can be used with `fit_transform` method and two parameters:
- `importance_threshold`: float = 0 - only features with *importance >= threshold* will be added to the output dataframe
- `max_features`: int  - only first TOP N features by importance will be returned, where *N = max_features*  

And `keep_input=True` will keep all initial columns from search dataset X:  
```python
enricher = FeaturesEnricher(
	search_keys={"subscription_activation_date": SearchKey.DATE}
)
enriched_dataframe.fit_transform(X, y, keep_input=True, max_features=2)
```

### ✅ Optional: reuse completed search for enrichment
`FeaturesEnricher` can be used with search id of completed state:
- `search_id`: str - id of completed fit operation (`enricher.get_search_id()`)
Search keys and features in X should be the same as on fit

```python
enricher = FeaturesEnricher(
  search_keys={"date": SearchKey.DATE},
  search_id = "abcdef00-0000-0000-0000-999999999999"
)

enricher.transform(X)
```

### 👩🏻‍💻 How can I share data/features with a community ? 
If you have ANY data which you might consider as royalty / license free ([Open Data](http://opendatahandbook.org/guide/en/what-is-open-data/)) and potentially valuable for ML applications, you may publish it for **community usage**:   
1. Please Sign Up [here](https://profile.upgini.com)
2. Copy *Upgini API key* from profile and upload your data from Upgini python library with this key:
```python
import pandas as pd
from upgini import SearchKey
from upgini.ads import upload_user_ads
import os
os.environ["UPGINI_API_KEY"] = "your_long_string_api_key_goes_here"
#you can define custom search key which might not be supported yet, just use SearchKey.CUSTOM_KEY type
sample_df = pd.read_csv("path_to_data_sample_file")
upload_user_ads("test", sample_df, {
    "city": SearchKey.CUSTOM_KEY,
    "stats_date": SearchKey.DATE
})
```
3. After data verification, search results on community data will be available usual way

## 🛠 Getting Help & Community
Please note, that we are still in a beta stage.
Requests and support, in preferred order  
[![Claim help in slack](https://img.shields.io/badge/slack-@upgini-orange.svg?style=for-the-badge&logo=slack)](https://upgini.slack.com/messages/C02MW49ADSN)
[![Open GitHub issue](https://img.shields.io/badge/open%20issue%20on-github-blue?style=for-the-badge&logo=github)](https://github.com/upgini/upgini/issues)  
Please try to create bug reports that are:
- _Reproducible._ Include steps to reproduce the problem.
- _Specific._ Include as much detail as possible: which Python version, what environment, etc.
- _Unique._ Do not duplicate existing opened issues.
- _Scoped to a Single Bug._ One bug per report.

## 🧩 Contributing
We are a **very** small team and this is a part-time project for us, thus most probably we won't be able:
 - implement smooth integration with most common low-code ML libraries and platforms ([PyCaret](https://www.github.com/pycaret/pycaret), [H2O AutoML](https://github.com//h2oai/h2o-3/blob/master/h2o-docs/src/product/automl.rst), etc. )
 - implement all possible data verification and normalization capabilities for different types of search keys (we just started with current 6 types)

And we need some help from community)
So, we'll be happy about every **pull request** you open and **issue** you find to make this library **more awesome**. Please note that it might sometimes take us a while to get back to you.
**For major changes**, please open an issue first to discuss what you would like to change
#### Developing
Some convenient ways to start contributing are:  
⚙️ [**Open in Visual Studio Code**](https://open.vscode.dev/upgini/upgini) You can remotely open this repo in VS Code without cloning or automatically clone and open it inside a docker container.  
⚙️ **Gitpod** [![Gitpod Ready-to-Code](https://img.shields.io/badge/Gitpod-Ready--to--Code-blue?logo=gitpod)](https://gitpod.io/#https://github.com/upgini/upgini) You can use Gitpod to launch a fully functional development environment right in your browser.

## 🔗 Useful links
- [Quick start guide](https://upgini.com/#quick-start)
- [Kaggle example notebook](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)
- [Project on PyPI](https://pypi.org/project/upgini)
- [Get API Key](https://profile.upgini.com)

<sup>😔 Found mistype or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
Please report it here.</a></sup>


