Metadata-Version: 2.1
Name: ts2ml
Version: 0.0.3
Summary: Tools to Transform a Time Series into Features and Target a.k.a Supervised Learning
Home-page: https://github.com/joaopcnogueira/ts2ml
Author: João Nogueira
Author-email: joaopcnogueira@gmail.com
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: dev
License-File: LICENSE

# ts2ml

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Install

``` sh
pip install ts2ml
```

## How to use

``` python
import pandas as pd
from ts2ml.core import add_missing_slots
```

``` python
df = pd.DataFrame({
    'pickup_hour': ['2022-01-01 00:00:00', '2022-01-01 01:00:00', '2022-01-01 03:00:00', '2022-01-01 01:00:00', '2022-01-01 02:00:00', '2022-01-01 05:00:00'],
    'pickup_location_id': [1, 1, 1, 2, 2, 2],
    'rides': [2, 3, 1, 1, 2, 1]
})
df
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|     | pickup_hour         | pickup_location_id | rides |
|-----|---------------------|--------------------|-------|
| 0   | 2022-01-01 00:00:00 | 1                  | 2     |
| 1   | 2022-01-01 01:00:00 | 1                  | 3     |
| 2   | 2022-01-01 03:00:00 | 1                  | 1     |
| 3   | 2022-01-01 01:00:00 | 2                  | 1     |
| 4   | 2022-01-01 02:00:00 | 2                  | 2     |
| 5   | 2022-01-01 05:00:00 | 2                  | 1     |

</div>

``` python
add_missing_slots(df, datetime_col='pickup_hour', entity_col='pickup_location_id', value_col='rides', freq='H')
```

    100%|██████████| 2/2 [00:00<00:00, 352.17it/s]

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|     | pickup_hour         | pickup_location_id | rides |
|-----|---------------------|--------------------|-------|
| 0   | 2022-01-01 00:00:00 | 1                  | 2     |
| 1   | 2022-01-01 01:00:00 | 1                  | 3     |
| 2   | 2022-01-01 02:00:00 | 1                  | 0     |
| 3   | 2022-01-01 03:00:00 | 1                  | 1     |
| 4   | 2022-01-01 04:00:00 | 1                  | 0     |
| 5   | 2022-01-01 05:00:00 | 1                  | 0     |
| 6   | 2022-01-01 00:00:00 | 2                  | 0     |
| 7   | 2022-01-01 01:00:00 | 2                  | 1     |
| 8   | 2022-01-01 02:00:00 | 2                  | 2     |
| 9   | 2022-01-01 03:00:00 | 2                  | 0     |
| 10  | 2022-01-01 04:00:00 | 2                  | 0     |
| 11  | 2022-01-01 05:00:00 | 2                  | 1     |

</div>

# Another Example

Montly spaced time series

``` python
import pandas as pd
import numpy as np

# Generate timestamp index with monthly frequency
date_rng = pd.date_range(start='1/1/2020', end='12/1/2022', freq='MS')

# Create list of city codes
cities = ['FOR', 'SP', 'RJ']

# Create dataframe with random sales data for each city on each month
df = pd.DataFrame({
    'timestamp': date_rng,
    'city': np.repeat(cities, len(date_rng)//len(cities)),
    'sales': np.random.randint(1000, 5000, size=len(date_rng))
})
df
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|     | timestamp  | city | sales |
|-----|------------|------|-------|
| 0   | 2020-01-01 | FOR  | 4216  |
| 1   | 2020-02-01 | FOR  | 4309  |
| 2   | 2020-03-01 | FOR  | 3639  |
| 3   | 2020-04-01 | FOR  | 3685  |
| 4   | 2020-05-01 | FOR  | 4481  |
| 5   | 2020-06-01 | FOR  | 4133  |
| 6   | 2020-07-01 | FOR  | 3504  |
| 7   | 2020-08-01 | FOR  | 3957  |
| 8   | 2020-09-01 | FOR  | 2781  |
| 9   | 2020-10-01 | FOR  | 2996  |
| 10  | 2020-11-01 | FOR  | 3963  |
| 11  | 2020-12-01 | FOR  | 2381  |
| 12  | 2021-01-01 | SP   | 1489  |
| 13  | 2021-02-01 | SP   | 3863  |
| 14  | 2021-03-01 | SP   | 4005  |
| 15  | 2021-04-01 | SP   | 3612  |
| 16  | 2021-05-01 | SP   | 4823  |
| 17  | 2021-06-01 | SP   | 1687  |
| 18  | 2021-07-01 | SP   | 3688  |
| 19  | 2021-08-01 | SP   | 1729  |
| 20  | 2021-09-01 | SP   | 1496  |
| 21  | 2021-10-01 | SP   | 2460  |
| 22  | 2021-11-01 | SP   | 1448  |
| 23  | 2021-12-01 | SP   | 3174  |
| 24  | 2022-01-01 | RJ   | 1201  |
| 25  | 2022-02-01 | RJ   | 3210  |
| 26  | 2022-03-01 | RJ   | 4580  |
| 27  | 2022-04-01 | RJ   | 1318  |
| 28  | 2022-05-01 | RJ   | 4607  |
| 29  | 2022-06-01 | RJ   | 1565  |
| 30  | 2022-07-01 | RJ   | 2935  |
| 31  | 2022-08-01 | RJ   | 3924  |
| 32  | 2022-09-01 | RJ   | 1577  |
| 33  | 2022-10-01 | RJ   | 4395  |
| 34  | 2022-11-01 | RJ   | 1867  |
| 35  | 2022-12-01 | RJ   | 2739  |

</div>

``` python
df.groupby('city').agg({'timestamp': ['min', 'max']})
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead tr th {
        text-align: left;
    }
&#10;    .dataframe thead tr:last-of-type th {
        text-align: right;
    }
</style>

|      | timestamp  |            |
|------|------------|------------|
|      | min        | max        |
| city |            |            |
| FOR  | 2020-01-01 | 2020-12-01 |
| RJ   | 2022-01-01 | 2022-12-01 |
| SP   | 2021-01-01 | 2021-12-01 |

</div>

FOR city only have data for 2020 year, RJ only for 2022 and SP only for
2021. Let’s also simulate more missing slots between the years.

``` python
# Generate random indices to drop
drop_indices = np.random.choice(df.index, size=int(len(df)*0.2), replace=False)

# Drop selected rows from dataframe
df = df.drop(drop_indices)
df.reset_index(drop=True, inplace=True)
df
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|     | timestamp  | city | sales |
|-----|------------|------|-------|
| 0   | 2020-01-01 | FOR  | 4216  |
| 1   | 2020-03-01 | FOR  | 3639  |
| 2   | 2020-05-01 | FOR  | 4481  |
| 3   | 2020-06-01 | FOR  | 4133  |
| 4   | 2020-07-01 | FOR  | 3504  |
| 5   | 2020-08-01 | FOR  | 3957  |
| 6   | 2020-09-01 | FOR  | 2781  |
| 7   | 2020-10-01 | FOR  | 2996  |
| 8   | 2020-11-01 | FOR  | 3963  |
| 9   | 2020-12-01 | FOR  | 2381  |
| 10  | 2021-01-01 | SP   | 1489  |
| 11  | 2021-02-01 | SP   | 3863  |
| 12  | 2021-07-01 | SP   | 3688  |
| 13  | 2021-08-01 | SP   | 1729  |
| 14  | 2021-10-01 | SP   | 2460  |
| 15  | 2022-01-01 | RJ   | 1201  |
| 16  | 2022-03-01 | RJ   | 4580  |
| 17  | 2022-04-01 | RJ   | 1318  |
| 18  | 2022-05-01 | RJ   | 4607  |
| 19  | 2022-07-01 | RJ   | 2935  |
| 20  | 2022-08-01 | RJ   | 3924  |
| 21  | 2022-09-01 | RJ   | 1577  |
| 22  | 2022-11-01 | RJ   | 1867  |
| 23  | 2022-12-01 | RJ   | 2739  |

</div>

Now lets fill the missing slots with zero values. The function will
complete the missing slots with zeros:

``` python
df_full = add_missing_slots(df, datetime_col='timestamp', entity_col='city', value_col='sales', freq='MS')
df_full
```

    100%|██████████| 3/3 [00:00<00:00, 844.15it/s]

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|     | timestamp  | city | sales |
|-----|------------|------|-------|
| 0   | 2020-01-01 | FOR  | 4216  |
| 1   | 2020-02-01 | FOR  | 0     |
| 2   | 2020-03-01 | FOR  | 3639  |
| 3   | 2020-04-01 | FOR  | 0     |
| 4   | 2020-05-01 | FOR  | 4481  |
| ... | ...        | ...  | ...   |
| 103 | 2022-08-01 | RJ   | 3924  |
| 104 | 2022-09-01 | RJ   | 1577  |
| 105 | 2022-10-01 | RJ   | 0     |
| 106 | 2022-11-01 | RJ   | 1867  |
| 107 | 2022-12-01 | RJ   | 2739  |

<p>108 rows × 3 columns</p>
</div>

``` python
df_full.groupby('city').agg({'timestamp': ['min', 'max']})
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead tr th {
        text-align: left;
    }
&#10;    .dataframe thead tr:last-of-type th {
        text-align: right;
    }
</style>

|      | timestamp  |            |
|------|------------|------------|
|      | min        | max        |
| city |            |            |
| FOR  | 2020-01-01 | 2022-12-01 |
| RJ   | 2020-01-01 | 2022-12-01 |
| SP   | 2020-01-01 | 2022-12-01 |

</div>


