Metadata-Version: 2.1
Name: csr2transmart
Version: 0.0.17
Summary: Script to load CSR data to TranSMART
Home-page: https://github.com/thehyve/python_csr2transmart
Author: Gijs Kant
Author-email: gijs@thehyve.nl
License: MIT
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Requires-Python: >=3.6.0
Requires-Dist: click (<8.0,>=7.0)
Requires-Dist: transmart-loader (<1.4.0,>=1.3.1)
Requires-Dist: pydantic (<0.33,>=0.32.1)
Requires-Dist: python-dateutil (==2.8.0)
Requires-Dist: pandas (<0.26.0,>=0.25.1)
Requires-Dist: PyYAML (<5.2,>=5.1)
Provides-Extra: dev
Requires-Dist: prospector[with_pyroma] ; extra == 'dev'
Requires-Dist: yapf ; extra == 'dev'
Requires-Dist: isort ; extra == 'dev'

CSR to TranSMART loader
=======================

|Build status| |codecov| |pypi| |status| |license|

.. |Build status| image:: https://travis-ci.org/thehyve/python_csr2transmart.svg?branch=master
   :alt: Build status
   :target: https://travis-ci.org/thehyve/python_csr2transmart/branches
.. |codecov| image:: https://codecov.io/gh/thehyve/python_csr2transmart/branch/master/graph/badge.svg
   :alt: codecov
   :target: https://codecov.io/gh/thehyve/python_csr2transmart
.. |pypi| image:: https://img.shields.io/pypi/v/csr2transmart.svg
   :alt: PyPI
   :target: https://pypi.org/project/csr2transmart/
.. |status| image:: https://img.shields.io/pypi/status/csr2transmart.svg
   :alt: PyPI - Status
.. |license| image:: https://img.shields.io/pypi/l/csr2transmart.svg
   :alt: MIT license
   :target: LICENSE

This package contains a script that transforms Central Subject Registry data to a format
that can be loaded into TranSMART_ platform,
an open source data sharing and analytics platform for translational biomedical research.

The output of the transformation is a collection of tab-separated files that can be loaded into
a TranSMART database using the transmart-copy_ tool.

.. _TranSMART: https://github.com/thehyve/transmart-core
.. _transmart-copy: https://github.com/thehyve/transmart-core/tree/dev/transmart-copy

⚠️ Note: this is a very preliminary version, still under development.
Issues can be reported at https://github.com/thehyve/python_csr2transmart/issues.


Installation and usage
**********************

To install csr2transmart, do:

.. code-block:: console

  pip install csr2transmart

or from sources:

.. code-block:: console

  git clone https://github.com/thehyve/python_csr2transmart.git
  cd python_csr2transmart
  pip install .


Data model
----------

The Central Subject Registry (CSR) data model contains individual,
diagnosis, biosource and biomaterial entities. The data model is defined
as a data class in `csr/csr.py`_

.. _`csr/csr.py`: https://github.com/thehyve/python_csr2transmart/blob/master/csr/csr.py

Usage
------

This repository contains a number of command line tools:

* ``sources2csr``: Reads from source files and produces tab delimited CSR files.
* ``csr2transmart``: Reads CSR files and transforms the data to the TranSMART data model,
  creating files that can be imported to TranSMART using `transmart-copy`.
* ``csr2cbioportal``: Reads CSR files and transforms the data to patient and sample files
  to imported into cBioPortal.

``sources2csr``
~~~~~~~~~~~~~~~

.. code-block:: console

  sources2csr <input_dir> <output_dir> <config_dir>

The tool reads input files from ``<input_dir>`` and
writes CSR files in tab delimited format (one file per entity type) to
``<output_dir>``.
The output directory ``<output_dir>`` needs to be either empty or not yet existing.

The sources configuration will be read from ``<config_dir>/sources_config.json``,
a JSON file that contains two attributes:

* ``entities``: a map from entity type name to a description of the sources for that entity type. E.g.,

  .. code-block:: json

    {
      "Individual": {
        "attributes": [
          {
            "name": "individual_id",
            "sources": [
              {
                "file": "individual.tsv",
                "column": "individual_id"
              }
            ]
          },
          {
            "name": "birth_date",
            "sources": [
              {
                "file": "individual.tsv",
                "date_format": "%d-%m-%Y"
              }
            ]
          }
        ]
      }
    }

  The entity type names have to match the entity type names in the CSR data model and
  the attribute names should match the attribute names in the data model as well.
  The ``column`` field is optional, by default the column name is assumed to be
  the same as the attribute name.
  For date fields, a ``date_format`` can be specified. If not specified, it is
  assumed to be ``%Y-%m-%d`` or any other `date formats supported by Pydantic`_.
  If multiple input files are specified for an attribute, data for that attribute
  is read in that order, i.e., only if the first file has no data for an attribute
  for a specific entity, data for that attribute for that entity is read from the next file, etc.

* ``codebooks``: a map from input file name to codebook file name, e.g., ``{"individual.tsv": "codebook.txt"}``.

* ``file_format``: a map from input file name to file format configuration,
  which allows to configure the delimiter character (default: ``\t``).
  E.g., ``{"individual.tsv": {"delimiter": ","}}``.

See `test_data/input_data/config/sources_config.json`_ for an example.

Content of the codebook files has to match the following format:

*   First a header line with a number and column names the codes apply to. 
    The first field has a number, the second field a space separated list of column names, e.g., ``1\tSEX GENDER``.
*   The lines following the header start with an empty field. 
    Then the lines follow the format of ``code\tvalue`` until the end of the line, 
    e.g., ``\t1\tMale\t2\tFemale``.
*   The start of a new header, which is detected by the first field not being empty 
    starts the process over again.

See `<test_data/input_data/codebooks/valid_codebook.txt>`_ for a codebook file example.

.. _`date formats supported by Pydantic`: https://pydantic-docs.helpmanual.io/#datetime-types
.. _`test_data/input_data/config/sources_config.json`: https://github.com/thehyve/python_csr2transmart/blob/master/test_data/input_data/config/sources_config.json


``csr2transmart``
~~~~~~~~~~~~~~~~~

.. code-block:: console

  csr2transmart <input_dir> <output_dir> <config_dir>

The tool reads CSR files from ``<input_dir>`` (one file per entity type),
transforms the CSR data to the TranSMART data model. 
In addition, if there is an ``NGS`` folder inside ``<input_dir>``, 
the tool will read the NGS files inside to determine values of additional CSR biomaterial variables.
The tool writes the output in ``transmart-copy`` format to ``<output_dir>``.
The output directory ``<output_dir>`` needs to be either empty or not yet existing.

The ontology configuration will be read from ``<config_dir>/ontology_config.json``.
See `test_data/input_data/config/ontology_config.json`_ for an example.

.. _`test_data/input_data/config/ontology_config.json`: https://github.com/thehyve/python_csr2transmart/blob/master/test_data/input_data/config/ontology_config.json


``csr2cbioportal``
~~~~~~~~~~~~~~~~~~

.. code-block:: console

  csr2cbioportal <input_dir> <ngs_dir> <output_dir>

The tool reads CSR files from ``<input_dir>`` (one file per entity type),
and NGS data (genomics data) from ``<ngs_dir>``,
transforms the CSR data to the clinical data format for cBioPortal and
writes the following data types to ``<output_dir>``:

* Clinical data 
* Mutation data
* CNA Segment data
* CNA Continuous data
* CNA Discrete data

File structure, case lists and meta files will also be also added in the output folder.
See the  `cBioPortal file formats`_ documentation for further details.

The output directory ``<output_dir>`` needs to be either empty or not yet existing.

.. _`cBioPortal file formats`: https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats



Python versions
---------------

This package supports Python versions 3.6 and 3.7.


Package management and dependencies
-----------------------------------

This project uses `pip` for installing dependencies and package management.

* Dependencies should be added to `requirements.txt`_.

.. _`requirements.txt`: https://github.com/thehyve/python_csr2transmart/blob/master/requirements.txt

Testing and code coverage
-------------------------

* Tests are in the ``tests`` folder.

* The ``tests`` folder contains tests for each of the tools and
  a test that checks whether your code conforms to the Python style guide (PEP 8) (file: ``test_lint.py``)

* The testing framework used is `PyTest <https://pytest.org>`_

* Tests can be run with ``python setup.py test``

Coding style conventions and code quality
-----------------------------------------

* Check your code style with ``prospector``

* You may need run ``pip install .[dev]`` first, to install the required dependencies


License
*******

Copyright (c) 2019 The Hyve B.V.

The CSR to TranSMART loader is licensed under the MIT License. See the file LICENSE_.

.. _LICENSE: https://github.com/thehyve/python_csr2transmart/blob/master/LICENSE




