Metadata-Version: 2.1
Name: nbmodular
Version: 0.0.5
Summary: Convert notebooks to modular code
Home-page: https://github.com/JaumeAmoresDS/nbmodular
Author: Jaume Amores
Author-email: jaume.dsdev@gmail.com
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: dev
License-File: LICENSE

nbmodular
================

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

Convert data science notebooks with poor modularity to fully modular
notebooks that are automatically exported as python modules.

## Motivation

In data science, it is usual to develop experimentally and quickly based
on notebooks, with little regard to software engineering practices and
modularity. It can become challenging to start working on someone else’s
notebooks with no modularity in terms of separate functions, and a great
degree of duplicated code between the different notebooks. This makes it
difficult to understand the logic in terms of semantically separate
units, see what are the commonalities and differences between the
notebooks, and be able to extend, generalize, and configure the current
solution.

## Objectives

`nbmodular` is a library conceived with the objective of helping
converting the cells of a notebook into separate functions with clear
dependencies in terms of inputs and outputs. This is done though a
combination of tools which semi-automatically understand the data-flow
in the code, based on mild assumptions about its structure. It also
helps test the current logic and compare it against a modularized
solution, to make sure that the refactored code is equivalent to the
original one.

## Features

- Convert cells to functions.
- The logic of a single function can be written across multiple cells.
- Optional: processed cells can continue to operate as cells or be only
  used as functions from the moment they are converted.
- Create an additional pipeline function that provides the data-flow
  from the first to the last function call in the notebook.
- Write all the notebook functions to a separate python module.
- Compare the result of the pipeline with the result of running the
  original notebook.
- Converted functions act as nodes in a dependency graph. These nodes
  can optionally hold the values of local variables for inspection
  outside of the function. This is similar to having a single global
  scope, which is the original situation. Since this is
  memory-consuming, it is optional and may not be the default.
- Optional: Once we are able to construct a graph, we may be able to
  draw it or show it in text, and pass it to ADG processors that can run
  functions sequentially or in parallel.
- Persist the inputs and outputs of functions, so that we may decide to
  reuse previous results without running the whole notebook.
- Optional: if we have the dependency graph and persisted inputs /
  outputs, we may decide to only run those cells that are predecessors
  of the current one, i.e., the ones that provide the inputs needed by
  the current cell.
- Optional: if we associate a hash code to input data, we may only run
  the cells when the input data changes. Similarly, if we associate a
  hash code with AST-converted function code, we may only run those
  cells whose code has been updated.
- Optional: have a mechanism for indicating test examples that go into
  different test python files. = Optional: the output of a test cell can
  be used for assertions, where we require that the current output is
  the same as the original one.

## Roadmap

- [ ] Convert cell code into functions:
  - [x] Inputs are those variables detected in current cell and also
    detected in previous cells. This solution requires that created
    variables have unique names across the notebook. However, even if a
    new variable with the same name is defined inside the cell, the
    resulting function is still correct.
  - Outputs are, at this moment, all the variables detected in current
    cell that are also detected in posterior cells.
- Filter out outputs:
  - Variables detected in current cell, and also detected in previous
    cells, might not be needed as outputs of the current cell, if the
    current cell doesn’t modify those variables. To detect potential
    modifications:
    - AST:
      - If variable appears only on the right of assign statements or in
        if statements.
      - If it appears only as argument of functions which we know don’t
        modify the variable, such as `print`.
    - Comparing variable values before and after cell:
      - Good for small variables where doing a deep copy is not
        computationally expensive.
    - Using type checker:
      - Making the variable `Final` and using mypy or other type checker
        to see if it is modified in the code.
  - Provide hints:
    - Variables that come from other cells might not be needed as
      output. The remaining are most probably needed.
    - Variables that are modified are clearly needed.

## Install

``` sh
pip install nbmodular
```

## Usage

Load ipython extension

This allows us to use the following of magic commands, among others

- function <name_of_function_to_define>
- print <name_of_previous_function>
- function_info <name_of_previous_function>
- print_pipeline

Let’s go one by one

### function

Use magic command `function` allows to:

- Run the code in the cell normally, and at the same time detect its
  input and output dependencies and define a function with this input
  and output:

``` python
a = 2
b = 3
c = a+b
print (a+b)
```

    5

The code in the previous cell runs as it normally would, but and at the
same time defines a function named `get_initial_values` which we can
show with the magic command `print`:

``` python
```

    def get_initial_values():
        a = 2
        b = 3
        c = a+b
        print (a+b)

This function is defined in the notebook space, so we can invoke it:

``` python
get_initial_values ()
```

    5

The inputs and outputs of the function change dynamically every time we
add a new function cell. For example, if we add a new function `get_d`:

``` python
d = 10
```

``` python
```

    def get_d():
        d = 10

And then a function `add_all` that depend on the previous two functions:

``` python
a = a + d
b = b + d
c = c + d
```

``` python
```

    def add_all(a, b, c, d):
        a = a + d
        b = b + d
        c = c + d

We can see that the uputs from `get_initial_values` and `get_d` change
as needed. We can look at all the functions defined so far by using
`print all`:

``` python
```

    def get_initial_values():
        a = 2
        b = 3
        c = a+b
        print (a+b)
        return a,b,c

    def get_d():
        d = 10
        return d

    def add_all(a, b, c, d):
        a = a + d
        b = b + d
        c = c + d

Similarly the outputs from the last function `add_all` change after we
add a other functions that depend on it:

``` python
print (a, b, c, d)
```

    12 13 15 10

### print

We can see each of the defined functions with `print my_function`, and
list all of them with `print all`

``` python
```

    def get_initial_values():
        a = 2
        b = 3
        c = a+b
        print (a+b)
        return a,b,c

    def get_d():
        d = 10
        return d

    def add_all(a, b, c, d):
        a = a + d
        b = b + d
        c = c + d
        return a,b,c,d

    def print_all(a, b, c, d):
        print (a, b, c, d)

### print_pipeline

As we add functions to the notebook, a pipeline function is defined. We
can print this pipeline with the magic `print_pipeline`

``` python
```

    def index_pipeline ():
        a, b, c = get_initial_values ()
        d = get_d ()
        a, b, c, d = add_all (a, b, c, d)
        print_all (a, b, c, d)

This shows the data flow in terms of inputs and outputs

And run it:

``` python
index_pipeline()
```

    5

### function_info

We can get access to many of the details of each of the defined
functions by calling `function_info` on a given function name:

``` python
get_initial_values_info = %function_info get_initial_values
```

This allows us to see:

- The name and value (at the time of running) of the local variables,
  arguments and results from the function:

``` python
get_initial_values_info.arguments
```

    []

``` python
get_initial_values_info.values_here
```

    {'a': 2, 'c': 5, 'b': 3}

``` python
get_initial_values_info.return_values
```

    ['a', 'b', 'c']

We can also inspect the original code written in the cell…

``` python
print (get_initial_values_info.original_code)
```

    a = 2
    b = 3
    c = a+b
    print (a+b)

the code of the defined function:

``` python
print (get_initial_values_info.code)
```

    def get_initial_values():
        a = 2
        b = 3
        c = a+b
        print (a+b)
        return a,b,c

.. and the AST trees:

``` python
print (get_initial_values_info.get_ast (code=get_initial_values_info.original_code))
```

    Module(
      body=[
        Assign(
          targets=[
            Name(id='a', ctx=Store())],
          value=Constant(value=2)),
        Assign(
          targets=[
            Name(id='b', ctx=Store())],
          value=Constant(value=3)),
        Assign(
          targets=[
            Name(id='c', ctx=Store())],
          value=BinOp(
            left=Name(id='a', ctx=Load()),
            op=Add(),
            right=Name(id='b', ctx=Load()))),
        Expr(
          value=Call(
            func=Name(id='print', ctx=Load()),
            args=[
              BinOp(
                left=Name(id='a', ctx=Load()),
                op=Add(),
                right=Name(id='b', ctx=Load()))],
            keywords=[]))],
      type_ignores=[])
    None

``` python
print (get_initial_values_info.get_ast (code=get_initial_values_info.code))
```

    Module(
      body=[
        FunctionDef(
          name='get_initial_values',
          args=arguments(
            posonlyargs=[],
            args=[],
            kwonlyargs=[],
            kw_defaults=[],
            defaults=[]),
          body=[
            Assign(
              targets=[
                Name(id='a', ctx=Store())],
              value=Constant(value=2)),
            Assign(
              targets=[
                Name(id='b', ctx=Store())],
              value=Constant(value=3)),
            Assign(
              targets=[
                Name(id='c', ctx=Store())],
              value=BinOp(
                left=Name(id='a', ctx=Load()),
                op=Add(),
                right=Name(id='b', ctx=Load()))),
            Expr(
              value=Call(
                func=Name(id='print', ctx=Load()),
                args=[
                  BinOp(
                    left=Name(id='a', ctx=Load()),
                    op=Add(),
                    right=Name(id='b', ctx=Load()))],
                keywords=[])),
            Return(
              value=Tuple(
                elts=[
                  Name(id='a', ctx=Load()),
                  Name(id='b', ctx=Load()),
                  Name(id='c', ctx=Load())],
                ctx=Load()))],
          decorator_list=[])],
      type_ignores=[])
    None

Now, we can define another function in a cell that uses variables from
the previous function.

### cell_processor

This magic allows us to get access to the CellProcessor class managing
the logic for running the above magic commands, which can become handy:

``` python
cell_processor = %cell_processor
```
