Metadata-Version: 1.1
Name: pytreecat
Version: 0.1.5
Summary: A Bayesian latent tree model of multivariate multinomial data
Home-page: https://github.com/posterior/treecat
Author: Fritz Obermeyer
Author-email: fritz.obermeyer@gmail.com
License: Apache License 2.0
Description: .. figure:: doc/cartoon.png
           :alt: A Bayesian latent tree model
        
           A Bayesian latent tree model
        
        TreeCat
        =======
        
        |Build Status| |Latest Version| |DOI|
        
        Intended Use
        ------------
        
        TreeCat is an inference engine intended to power higher-level machine
        learning tools. TreeCat is appropriate for analyzing medium-sized
        tabular data with categorical and ordinal values, possibly with missing
        observations.
        
        +--------------------------+--------------------------+
        |                          | TreeCat supports         |
        +==========================+==========================+
        | Feature Types            | categorical, ordinal     |
        +--------------------------+--------------------------+
        | # Rows (n)               | 1000-100K                |
        +--------------------------+--------------------------+
        | # Features (p)           | 10-1000                  |
        +--------------------------+--------------------------+
        | # Cells (n × p)          | 10K-10M                  |
        +--------------------------+--------------------------+
        | # Categories             | 2-10ish                  |
        +--------------------------+--------------------------+
        | Max Ordinal              | 10ish                    |
        +--------------------------+--------------------------+
        | Missing obervations?     | yes                      |
        +--------------------------+--------------------------+
        | Repeated observations?   | yes                      |
        +--------------------------+--------------------------+
        | Sparse data?             | no, use something else   |
        +--------------------------+--------------------------+
        | Unsupervised             | yes                      |
        +--------------------------+--------------------------+
        | Semisupervised           | yes                      |
        +--------------------------+--------------------------+
        | Supervised               | no, use something else   |
        +--------------------------+--------------------------+
        
        Installing
        ----------
        
        First install ``numba`` (conda makes this easy). Then
        
        .. code:: sh
        
            $ pip install pytreecat
        
        Quick Start
        -----------
        
        1. Format your data as a
           ```data.csv`` <treecat/testdata/tiny_data.csv>`__ file with a header
           row. It's fine to include extra columns that won't be used.
        
           Contents of ```data.csv`` <treecat/testdata/tiny_data.csv>`__:
        
           +-------------+------------+----------+----------+
           | title       | genre      | decade   | rating   |
           +=============+============+==========+==========+
           | vertigo     | thriller   | 1950s    | 5        |
           +-------------+------------+----------+----------+
           | up          | family     | 2000s    | 3        |
           +-------------+------------+----------+----------+
           | desk set    | comedy     | 1950s    | 4        |
           +-------------+------------+----------+----------+
           | santapaws   | family     | 2010s    |          |
           +-------------+------------+----------+----------+
           | ...         | ...        | ...      | ...      |
           +-------------+------------+----------+----------+
        
        2. Generate two schema files
           ```types.csv`` <treecat/testdata/tiny_types.csv>`__ and
           ```values.csv`` <treecat/testdata/tiny_values.csv>`__ using TreeCat's
           ``guess-schema`` command:
        
           .. code:: sh
        
               $ treecat guess-schema data.csv types.csv values.csv
        
           You can manually fix any incorrectly guessed feature types, or
           add/remove feature values. TreeCat ignore features with an empty type
           field.
        
           Contents of ```types.csv`` <treecat/testdata/tiny_types.csv>`__:
        
           +----------+---------------+---------+----------+--------------+
           | name     | type          | total   | unique   | singletons   |
           +==========+===============+=========+==========+==============+
           | title    |               | 11      | 11       | 11           |
           +----------+---------------+---------+----------+--------------+
           | genre    | categorical   | 11      | 7        | 4            |
           +----------+---------------+---------+----------+--------------+
           | decade   | categorical   | 11      | 6        | 3            |
           +----------+---------------+---------+----------+--------------+
           | rating   | ordinal       | 10      | 5        | 2            |
           +----------+---------------+---------+----------+--------------+
        
           Contents of ```values.csv`` <treecat/testdata/tiny_values.csv>`__:
        
           +---------+-----------+---------+
           | name    | value     | count   |
           +=========+===========+=========+
           | title   | \_OTHER   | 11      |
           +---------+-----------+---------+
           | genre   | \_OTHER   | 11      |
           +---------+-----------+---------+
           | genre   | drama     | 3       |
           +---------+-----------+---------+
           | genre   | family    | 2       |
           +---------+-----------+---------+
           | ...     | ...       | ...     |
           +---------+-----------+---------+
        
        3. Import your csv files into treecat's internal format. We'll call our
           dataset ``dataset.pkz`` (a gzipped pickle file).
        
           .. code:: sh
        
               $ treecat import-data data.csv types.csv values dataset.pkz
        
        4. Train an ensemble model on your dataset. This typically takes
           ~15minutes for a 1M cell dataset.
        
           .. code:: sh
        
               $ treecat train dataset.pkz ensemble.pkz
        
        5. Load your trained model into a server
        
           .. code:: python
        
               from treecat.serving import EnsembleServer
        
               server = EnsembleServer('ensemble.pkz')
        
        6. Run queries against the server. For example we can compute marginals
        
           .. code:: python
        
               server.sample(100, np.ones(V)).mean(axis=0)
        
           or compute a latent correlation matrix
        
           .. code:: python
        
               print(server.latent_correlation())
        
        The Server Interface
        --------------------
        
        TreeCat's
        `server <https://github.com/fritzo/treecat/blob/master/treecat/serving.py>`__
        interface currently supports the two basic Bayesian operations:
        
        -  ``server.sample(N, counts, data=None)`` draws N samples from the
           joint posterior distribution, optionally conditioned on ``data``.
        
        -  ``server.logprob(data)`` computes posterior log probability of data.
        
        TreeCat's internal data representation is multinomial, and thus supports
        missing and repeated measurements, and even data adding. For example to
        compute conditional probability of data ``A`` given data ``B``, we can
        simply compute
        
        .. code:: py
        
            cond = server.logprob(A + B) - server.logprob(B)
        
        The Model
        ---------
        
        Let ``V`` be a set of vertices (one vertex per feature). Let ``C[v]`` be
        the dimension of the ``v``\ th feature. Let ``N`` be the number of
        datapoints. Let ``K[n,v]`` be the number of observations of feature
        ``v`` in row ``n`` (e.g. 1 for a categorical variable, 0 for missing
        data, or ``k`` for an ordinal value with minimum 0 and maximum ``k``).
        
        TreeCat is the following generative model:
        
        .. code:: python
        
            E ~ UniformSpanningTree(V)    # An undirected tree.
            for v in V:
                Pv[v] ~ Dirichlet(size = [M], alpha = 1/2)
            for (u,v) in E:
                Pe[u,v] ~ Dirichlet(size = [M,M], alpha = 1/(2*M))
                assume(Pv[u] == sum(Pe[u,v], axis = 1))
                assume(Pv[v] == sum(Pe[u,v], axis = 0))
            for v in V:
                for i in 1:M:
                    Q[v,i] ~ Dirichlet(size = [C[v]])
            for n in 1:N:
                for v in V:
                    X[n,v] ~ Categorical(Pv[v])
                for (u,v) in E:
                    (X[n,u],X[n,v]) ~ Categorical(Pe[u,v])
                for v in V:
                    Z[n,v] ~ Multinomial(Q[v,X[n,v]], count = K[n,v])
        
        where we've avoided adding an arbitrary root to the tree, and instead
        presented the model as a manifold with overlapping variables and
        constraints.
        
        The Inference Algorithm
        -----------------------
        
        This package implements fully Bayesian MCMC inference using
        subsample-annealed Gibbs sampling. There are two pieces of latent state
        that are sampled:
        
        -  Latent classes for each row for each vertex. These are sampled by
           single-site Gibbs sampling with a linear subsample-annealing
           schedule.
        
        -  The latent tree structure is sampled by randomly removing an edge and
           replacing it. Since removing an edge splits the graph into two
           connected components, the only replacement locations that are
           feasible are those that re-connect the graph.
        
        The single-site Gibbs sampler uses dynamic programming to simultaneously
        sample the complete latent assignment vector for each row. A dynamic
        programming program is created each time the tree structure changes.
        This program is interpreted by various virtual machines for different
        purposes (training the model, sampling from the posterior, computing log
        probability of the posterior). The virtual machine for training is
        jit-compiled using numba.
        
        License
        -------
        
        Copyright (c) 2017 Fritz Obermeyer. TreeCat is licensed under the
        `Apache 2.0 License </LICENSE>`__.
        
        .. |Build Status| image:: https://travis-ci.org/posterior/treecat.svg?branch=master
           :target: https://travis-ci.org/posterior/treecat
        .. |Latest Version| image:: https://badge.fury.io/py/pytreecat.svg
           :target: https://pypi.python.org/pypi/pytreecat
        .. |DOI| image:: https://zenodo.org/badge/93913649.svg
           :target: https://zenodo.org/badge/latestdoi/93913649
        
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.4
Classifier: Topic :: Scientific/Engineering :: Information Analysis
