Metadata-Version: 2.1
Name: prodmodel
Version: 0.1.2
Summary: Build data science pipelines and models
Home-page: https://github.com/prodmodel/prodmodel
Author: Gergely Svigruha
Author-email: gergely.svigruha@prodmodel.com
License: UNKNOWN
Description: # Prodmodel
        
        Prodmodel is a [build system](https://en.wikipedia.org/wiki/List_of_build_automation_software) for data science pipelines.
        Users, testers, contributors are welcome! Please don't forget to **hit a star** if you like the project.
        
        <h3 align="center">
          <a href="#why">Why</a>
          <span> · </span>
          <a href="#concepts">Concepts</a>
          <span> · </span>
          <a href="#installation">Installation</a>
          <span> · </span>
          <a href="#usage">Usage</a>
          <span> · </span>
          <a href="#contributing">Contributing</a>
          <span> · </span>
          <a href="#licence">Licence</a>
        </h3>
        
        ## Why
        
         * Performance. No need to rerun things, everything is cached. It also makes it super easy to switch and compare between multiple versions.
         * Easy debugging. Ever lost track of which piece of code or data was used for some part of the pipeline? Prodmodel tracks and version controls
           all dependencies for you.
         * Deploy to production. Models are more than just a file. Prodmodel makes sure that the correct version of models, label encoders,
           feature transformation code and data files are all packaged together.
        
        ## Concepts
        
        A build system is a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph) of `rules` (transformations), `inputs` and `targets`.
        In Prodmodel `inputs` can be
         * data,
         * Python code,
         * and configuration.
        
        A `rule` is transforming any of the above to an output (which can in turn be depended on by other rules). Therefore rules need to be
        re-executed (and their outputs re-created) if any of their dependencies change. Prodmodel keeps track all of these dependencies.
        
        The outputs of the `rules` are `targets`. Every `target` corresponds to an output (e.g. a model or a dataset). These outputs
        are cached and version controlled.
        
        Prodmodel therefore ensures
         * correctness, by executing every code (e.g. feature transformation, model building, tests) which can potentially be affected by a change, and
         * performance, by executing only the necessary code, saving time compared to rerunning the whole pipeline.
        
        ### Rules
        
        Every rule is a statically typed function, where the inputs are targets, data, or configs. The execution of
        a rule outputs some data (e.g. a different feature set or a model), which can be used in other rules.
        
        In order to use Prodmodel your code has to be structured as functions which the rules can call into.
        
        ### Targets
        
        Targets are created by rule functions. Targets can be executed to generate output files. `IterableDataTarget` is a special target
        which can be used as an iterable of `dicts` to make iterating over datasets easier. Regular `DataTargets` can represent any
        Python object.
        
        ## Installation
        
        Prodmodel requires at least Python3.6. Use [pip](https://pip.pypa.io/en/stable/) to install prodmodel.
        
        ```bash
        pip install prodmodel --user
        ```
        
        ## Usage
        
        Create a `build.py` file in your data science folder. The build file contains references to your inputs and the build rules you can execute.
        
        ```python
        import rules
        
        csv_data = rules.data_source(file='data.csv', type='csv', dtypes={...})
        
        my_model = rules.transform(objects={'data': csv_data}, file='kmeans.py', fn='compute_kmeans')
        ```
        
        Now you can build your model by running `prodmodel my_model` from the directory of `build.py`,
        or `prodmodel <path_to_my_directory>:my_model` from any directory.
        
        Check out a complete [example project](https://github.com/prodmodel/prodmodel/tree/master/example) for more examples.
        
        The complete list of build rules can be found [here](https://github.com/prodmodel/prodmodel/blob/master/doc/api_doc.md).
        
        ### Arguments
        
         * `--force_external`: Some data sources are remote (e.g. an SQL server), therefore tracking changes is not always feasible.
           This argument gives the user manual control over when to reload these data sources.
         * `--cache_data`: Cache local data files if changed. This can be useful for debugging / reproducibility by making sure every
           data source used for a specific build is saved.
        
        ## Contributing
        Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
        
        ## License
        [Apache 2.0](https://github.com/prodmodel/prodmodel/blob/master/LICENCE)
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3.6
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 3 - Alpha
Description-Content-Type: text/markdown
