Metadata-Version: 2.1
Name: nbmodular
Version: 0.0.22
Summary: Convert notebooks to modular code
Home-page: https://github.com/JaumeAmoresDS/nbmodular
Author: Jaume Amores
Author-email: jaume.dsdev@gmail.com
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastcore
Requires-Dist: nbdev
Requires-Dist: pandas
Requires-Dist: ipynbname
Requires-Dist: scikit-learn
Requires-Dist: fastdot
Provides-Extra: dev
Requires-Dist: pytest ; extra == 'dev'

nbmodular
================

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

Convert data science notebooks with poor modularity to fully modular
notebooks that are automatically exported as python modules.

## Motivation

In data science, it is usual to develop experimentally and quickly based
on notebooks, with little regard to software engineering practices and
modularity. It can become challenging to start working on someone else’s
notebooks with no modularity in terms of separate functions, and a great
degree of duplicated code between the different notebooks. This makes it
difficult to understand the logic in terms of semantically separate
units, see what are the commonalities and differences between the
notebooks, and be able to extend, generalize, and configure the current
solution.

## Objectives

`nbmodular` is a library conceived with the objective of helping
converting the cells of a notebook into separate functions with clear
dependencies in terms of inputs and outputs. This is done though a
combination of tools which semi-automatically understand the data-flow
in the code, based on mild assumptions about its structure. It also
helps test the current logic and compare it against a modularized
solution, to make sure that the refactored code is equivalent to the
original one.

## Features

- [x] Convert cells to functions.
- [x] The logic of a single function can be written across multiple
  cells.
- [x] Functions can be either regular functions or unit test functions.
- [x] Functions and tests are exported to separate python modules.
- [ ] TODO: use nbdev to sync the exported python module with the
  notebook code, so that changes to the module are reflected back in the
  notebook.
- [x] Processed cells can continue to operate as cells or be only used
  as functions.
- [x] A pipeline function is automatically created and updated. This
  pipeline provides the data-flow from the first to the last function
  call in the notebook.
- [x] Functions act as nodes in a dependency graph. These nodes can
  optionally hold the values of local variables for inspection outside
  of the function. This is similar to having a single global scope,
  which is the original situation. Since this is memory-consuming,
  storing local variables is optional.
- [x] Local variables are persisted in disk, so that we may decide to
  reuse previous results without running the whole notebook.
- [ ] TODO: Once we are able to construct a graph, we may be able to
  draw it or show it in text, and pass it to ADG processors that can run
  functions sequentially or in parallel.
- [ ] TODO: if we have the dependency graph and persisted inputs /
  outputs, we may decide to only run those cells that are predecessors
  of the current one, i.e., the ones that provide the inputs needed by
  the current cell.
- [ ] TODO: if we associate a hash code to input data, we may only run
  the cells when the input data changes. Similarly, if we associate a
  hash code with AST-converted function code, we may only run those
  cells whose code has been updated.
- [ ] TODO: the output of a test cell can be used for assertions, where
  we require that the current output is the same as the original one.
- [ ] TODO: Compare the result of the pipeline with the result of
  running the original notebook.
- [ ] TODO: Currently, AST processing is used for assessing whether
  variables are modified in the cell or are just read. This just gives
  an estimate. We may want to compare the values of existing variables
  before and after running the code in the cell. We may also use a type
  checker such as mypy to assess whether a variable is immutable in the
  cell (e.g., mark the variable as Final and see if mypy complaints)

## Install

``` sh
pip install nbmodular
```

## Usage

Load ipython extension

This allows us to use the following of magic commands, among others

- function <name_of_function_to_define>
- print <name_of_previous_function>
- function_info <name_of_previous_function>
- print_pipeline

Let’s go one by one

### function

Use magic command `function` allows to:

- Run the code in the cell normally, and at the same time detect its
  input and output dependencies and define a function with this input
  and output:

``` python
a = 2
b = 3
c = a+b
print (a+b)
```

    5

The code in the previous cell runs as it normally would, but and at the
same time defines a function named `get_initial_values` which we can
show with the magic command `print`:

``` python
```

    def get_initial_values(test=False):
        a = 2
        b = 3
        c = a+b
        print (a+b)

This function is defined in the notebook space, so we can invoke it:

``` python
```

    def get_initial_values(test=False):
        a = 2
        b = 3
        c = a+b
        print (a+b)

The inputs and outputs of the function change dynamically every time we
add a new function cell. For example, if we add a new function `get_d`:

``` python
d = 10
```

``` python
```

    def get_d():
        d = 10

And then a function `add_all` that depend on the previous two functions:

``` python
a = a + d
b = b + d
c = c + d
```

``` python
f = %function_info add_all
```

``` python
print(f.code)
```

    def add_all(d, b, c, a):
        a = a + d
        b = b + d
        c = c + d

``` python
```

    def add_all(d, b, c, a):
        a = a + d
        b = b + d
        c = c + d

``` python
```


    from sklearn.utils import Bunch
    from pathlib import Path
    import joblib
    import pandas as pd
    import numpy as np

    def test_index_pipeline (test=True, prev_result=None, result_file_name="index_pipeline"):
        result = index_pipeline (test=test, load=True, save=True, result_file_name=result_file_name)
        if prev_result is None:
            prev_result = index_pipeline (test=test, load=True, save=True, result_file_name=f"test_{result_file_name}")
        for k in prev_result:
            assert k in result
            if type(prev_result[k]) is pd.DataFrame:    
                pd.testing.assert_frame_equal (result[k], prev_result[k])
            elif type(prev_result[k]) is np.array:
                np.testing.assert_array_equal (result[k], prev_result[k])
            else:
                assert result[k]==prev_result[k]

``` python
```


    def index_pipeline (test=False, load=True, save=True, result_file_name="index_pipeline"):

        # load result
        result_file_name += '.pk'
        path_variables = Path ("index") / result_file_name
        if load and path_variables.exists():
            result = joblib.load (path_variables)
            return result

        b, c, a = get_initial_values (test=test)
        d = get_d ()
        add_all (d, b, c, a)

        # save result
        result = Bunch (b=b,c=c,a=a,d=d)
        if save:    
            path_variables.parent.mkdir (parents=True, exist_ok=True)
            joblib.dump (result, path_variables)
        return result

``` python
```

    def add_all(d, b, c, a):
        a = a + d
        b = b + d
        c = c + d

We can see that the uputs from `get_initial_values` and `get_d` change
as needed. We can look at all the functions defined so far by using
`print all`:

``` python
```

    def get_initial_values(test=False):
        a = 2
        b = 3
        c = a+b
        print (a+b)
        return b,c,a

    def get_d():
        d = 10
        return d

    def add_all(d, b, c, a):
        a = a + d
        b = b + d
        c = c + d

Similarly the outputs from the last function `add_all` change after we
add a other functions that depend on it:

``` python
print (a, b, c, d)
```

    12 13 15 10

### print

We can see each of the defined functions with `print my_function`, and
list all of them with `print all`

``` python
```

    def get_initial_values(test=False):
        a = 2
        b = 3
        c = a+b
        print (a+b)
        return b,c,a

    def get_d():
        d = 10
        return d

    def add_all(d, b, c, a):
        a = a + d
        b = b + d
        c = c + d
        return b,c,a

    def print_all(b, d, a, c):
        print (a, b, c, d)

### print_pipeline

As we add functions to the notebook, a pipeline function is defined. We
can print this pipeline with the magic `print_pipeline`

``` python
```


    def index_pipeline (test=False, load=True, save=True, result_file_name="index_pipeline"):

        # load result
        result_file_name += '.pk'
        path_variables = Path ("index") / result_file_name
        if load and path_variables.exists():
            result = joblib.load (path_variables)
            return result

        b, c, a = get_initial_values (test=test)
        d = get_d ()
        b, c, a = add_all (d, b, c, a)
        print_all (b, d, a, c)

        # save result
        result = Bunch (b=b,d=d,c=c,a=a)
        if save:    
            path_variables.parent.mkdir (parents=True, exist_ok=True)
            joblib.dump (result, path_variables)
        return result

This shows the data flow in terms of inputs and outputs

And run it:

``` python
self = %cell_processor
```

``` python
self.function_list
```

    [FunctionProcessor with name get_initial_values, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'defined', 'permanent', 'signature', 'norun', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'all_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'code'])
         Arguments: []
         Output: ['b', 'c', 'a']
         Locals: dict_keys(['a', 'b', 'c']),
     FunctionProcessor with name get_d, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'defined', 'permanent', 'signature', 'norun', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'all_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'code'])
         Arguments: []
         Output: ['d']
         Locals: dict_keys(['d']),
     FunctionProcessor with name add_all, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'defined', 'permanent', 'signature', 'norun', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'all_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'code'])
         Arguments: ['d', 'b', 'c', 'a']
         Output: ['b', 'c', 'a']
         Locals: dict_keys(['a', 'b', 'c']),
     FunctionProcessor with name print_all, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'defined', 'permanent', 'signature', 'norun', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'all_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'code'])
         Arguments: ['b', 'd', 'a', 'c']
         Output: []
         Locals: dict_keys([])]

``` python
```

    def get_initial_values(test=False):
        a = 2
        b = 3
        c = a+b
        print (a+b)
        return b,c,a

    def get_d():
        d = 10
        return d

    def add_all(d, b, c, a):
        a = a + d
        b = b + d
        c = c + d
        return b,c,a

    def print_all(b, d, a, c):
        print (a, b, c, d)

``` python
index_pipeline()
```

    {'d': 10, 'b': 13, 'a': 12, 'c': 15}

### function_info

We can get access to many of the details of each of the defined
functions by calling `function_info` on a given function name:

``` python
get_initial_values_info = %function_info get_initial_values
```

This allows us to see:

- The name and value (at the time of running) of the local variables,
  arguments and results from the function:

``` python
get_initial_values_info.arguments
```

    []

``` python
get_initial_values_info.current_values
```

    {'a': 2, 'b': 3, 'c': 5}

``` python
get_initial_values_info.return_values
```

    ['b', 'c', 'a']

We can also inspect the original code written in the cell…

``` python
print (get_initial_values_info.original_code)
```

    a = 2
    b = 3
    c = a+b
    print (a+b)

the code of the defined function:

``` python
print (get_initial_values_info.code)
```

    def get_initial_values(test=False):
        a = 2
        b = 3
        c = a+b
        print (a+b)
        return b,c,a

.. and the AST trees:

``` python
print (get_initial_values_info.get_ast (code=get_initial_values_info.original_code))
```

    Module(
      body=[
        Assign(
          targets=[
            Name(id='a', ctx=Store())],
          value=Constant(value=2)),
        Assign(
          targets=[
            Name(id='b', ctx=Store())],
          value=Constant(value=3)),
        Assign(
          targets=[
            Name(id='c', ctx=Store())],
          value=BinOp(
            left=Name(id='a', ctx=Load()),
            op=Add(),
            right=Name(id='b', ctx=Load()))),
        Expr(
          value=Call(
            func=Name(id='print', ctx=Load()),
            args=[
              BinOp(
                left=Name(id='a', ctx=Load()),
                op=Add(),
                right=Name(id='b', ctx=Load()))],
            keywords=[]))],
      type_ignores=[])
    None

``` python
print (get_initial_values_info.get_ast (code=get_initial_values_info.code))
```

    Module(
      body=[
        FunctionDef(
          name='get_initial_values',
          args=arguments(
            posonlyargs=[],
            args=[
              arg(arg='test')],
            kwonlyargs=[],
            kw_defaults=[],
            defaults=[
              Constant(value=False)]),
          body=[
            Assign(
              targets=[
                Name(id='a', ctx=Store())],
              value=Constant(value=2)),
            Assign(
              targets=[
                Name(id='b', ctx=Store())],
              value=Constant(value=3)),
            Assign(
              targets=[
                Name(id='c', ctx=Store())],
              value=BinOp(
                left=Name(id='a', ctx=Load()),
                op=Add(),
                right=Name(id='b', ctx=Load()))),
            Expr(
              value=Call(
                func=Name(id='print', ctx=Load()),
                args=[
                  BinOp(
                    left=Name(id='a', ctx=Load()),
                    op=Add(),
                    right=Name(id='b', ctx=Load()))],
                keywords=[])),
            Return(
              value=Tuple(
                elts=[
                  Name(id='b', ctx=Load()),
                  Name(id='c', ctx=Load()),
                  Name(id='a', ctx=Load())],
                ctx=Load()))],
          decorator_list=[])],
      type_ignores=[])
    None

Now, we can define another function in a cell that uses variables from
the previous function.

### cell_processor

This magic allows us to get access to the CellProcessor class managing
the logic for running the above magic commands, which can become handy:

``` python
cell_processor = %cell_processor
```

## Merging function cells

In order to explore intermediate results, it is convenient to split the
code in a function among different cells. This can be done by passing
the flag `--merge True`

``` python
x = [1, 2, 3]
y = [100, 200, 300]
z = [u+v for u,v in zip(x,y)]
```

``` python
z
```

    [101, 202, 303]

``` python
```

    def analyze():
        x = [1, 2, 3]
        y = [100, 200, 300]
        z = [u+v for u,v in zip(x,y)]

``` python
product = [u*v for u, v in zip(x,y)]
```

``` python
```

    def analyze():
        x = [1, 2, 3]
        y = [100, 200, 300]
        z = [u+v for u,v in zip(x,y)]
        product = [u*v for u, v in zip(x,y)]

# Test functions

By passing the flag `--test` we can indicate that the logic in the cell
is dedicated to test other functions in the notebook. The test function
is defined taking the well-known `pytest` library as a test engine in
mind.

This has the following consequences:  
- The analysis of dependencies is not associated with variables found in
other cells. - Test functions do not appear in the overall pipeline. -
The data variables used by the test function can be defined in separate
test data cells which in turn are converted to functions. These
functions are called at the beginning of the test cell.

Let’s see an example

``` python
a = 5
b = 3
c = 6
d = 7
```

``` python
add_all(d, a, b, c)
```

    (12, 10, 13)

``` python
# test function add_all
assert add_all(d, a, b, c)==(12, 10, 13)
```

``` python
```

    def test_add_all():
        b,c,a,d = test_input_add_all()
        # test function add_all
        assert add_all(d, a, b, c)==(12, 10, 13)

``` python
```

    def test_input_add_all(test=False):
        a = 5
        b = 3
        c = 6
        d = 7
        return b,c,a,d

Test functions are written in a separate test module, withprefix `test_`

``` python
!ls ../tests
```

    index.ipynb  test_example.py

# Imports

In order to include libraries in our python module, we can use the magic
imports. Those will be written at the beginning of the module:

``` python
import pandas as pd
```

Imports can be indicated separately for the test module by passing the
flag `--test`:

``` python
import matplotlib.pyplot as plt
```

# Defined functions

Functions can be included already being defined with signature and
return values. The only caveat is that, if we want the function to be
executed, the variables in the argument list need to be created outside
of the function. Otherwise we need to pass the flag –norun to avoid
errors:

``` python
def myfunc (x, y, a=1, b=3):
    print ('hello', a, b)
    c = a+b
    return c
```

Although the internal code of the function is not executed, it is still
parsed using an AST. This allows to provide very tentative *warnings*
regarding names not found in the argument list

``` python
def other_func (x, y):
    print ('hello', a, b)
    c = a+b
    return c
```

    Detected the following previous variables that are not in the argument list: ['b', 'a']

Let’s do the same but running the function:

``` python
a=1
b=3
```

``` python
def myfunc (x, y, a=1, b=3):
    print ('hello', a, b)
    c = a+b
    return c
```

    hello 1 3

``` python
myfunc (10, 20)
```

    hello 1 3

    4

``` python
myfunc_info = %function_info myfunc
```

``` python
myfunc_info
```

    FunctionProcessor with name myfunc, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'defined', 'permanent', 'signature', 'norun', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'all_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'code'])
        Arguments: ['x', 'y', 'a', 'b']
        Output: ['c']
        Locals: dict_keys(['c'])

``` python
myfunc_info.c
```

    4

# Storing local variables in memory

By default, when we run a cell function its local variables are stored
in a dictionary called `current_values`:

``` python
my_new_local = 3
my_other_new_local = 4
```

The stored variables can be accessed by calling the magic
`function_info`:

``` python
my_new_function_info = %function_info my_new_function
```

``` python
my_new_function_info.current_values
```

    {'my_new_local': 3, 'my_other_new_local': 4}

This default behaviour can be overriden by passing the flag
`--not-store`

``` python
my_second_variable = 100
my_second_other_variable = 200
```

``` python
my_second_new_function_info = %function_info my_second_new_function
```

``` python
my_second_new_function_info.current_values
```

    {}

# (Un)packing Bunch I/O

``` python
from sklearn.utils import Bunch
```

``` python
x = Bunch (a=1, b=2)
```

``` python
c = 3
a = 4
```

``` python
```

    def bunch_processor(x, day):
        a = x["a"]
        b = x["b"]
        c = 3
        a = 4
        x["a"] = a
        x["c"] = c
        x["day"] = day
        return x

# Function’s info object holding local variables

``` python
df = pd.DataFrame (dict(Year=[1,2,3], Month=[1,2,3], Day=[1,2,3]))
fy = '2023'
```

``` python
def days (df, fy, x=1, /, y=3, *, n=4):
    df_group = df.groupby(['Year','Month']).agg({'Day': lambda x: len (x)})
    df_group = df.reset_index()
    print ('other args: fy', fy, 'x', x, 'y', y)
    return df_group
```

    other args: fy 2023 x 1 y 3
    Stored the following local variables in the days current_values dictionary: ['df_group']
    Detected the following previous variables that are not in the argument list: ['x', 'df', 'fy']

An info object with name <function_name>\_info is created in memory, and
can be used to get access to local variables

``` python
days_info.df_group
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>index</th>
      <th>Year</th>
      <th>Month</th>
      <th>Day</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0</td>
      <td>1</td>
      <td>1</td>
      <td>1</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>2</td>
      <td>2</td>
      <td>2</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2</td>
      <td>3</td>
      <td>3</td>
      <td>3</td>
    </tr>
  </tbody>
</table>
</div>

There is more information in this object: previous variables, code, etc.

``` python
days_info.current_values
```

    {'df_group':    index  Year  Month  Day
     0      0     1      1    1
     1      1     2      2    2
     2      2     3      3    3}

``` python
days_info
```

    FunctionProcessor with name days, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'defined', 'permanent', 'signature', 'not_run', 'previous_values', 'current_values', 'returns_dict', 'returns_bunch', 'unpack_bunch', 'include_input', 'exclude_input', 'include_output', 'exclude_output', 'store_locals_in_disk', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'all_variables', 'idx'])
        Arguments: ['df', 'fy', 'x', 'y']
        Output: ['df_group']
        Locals: dict_keys(['df_group'])

The function can also be called directly:

``` python
days (df*100, 100, x=4)
```

    other args: fy 100 x 4 y 3

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>index</th>
      <th>Year</th>
      <th>Month</th>
      <th>Day</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0</td>
      <td>100</td>
      <td>100</td>
      <td>100</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>200</td>
      <td>200</td>
      <td>200</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2</td>
      <td>300</td>
      <td>300</td>
      <td>300</td>
    </tr>
  </tbody>
</table>
</div>
