Metadata-Version: 2.0
Name: tap-redshift
Version: 1.0.0b1
Summary: Singer.io tap for extracting data from redshift
Home-page: UNKNOWN
Author: data.world
Author-email: help@data.world
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3 :: Only
Requires-Dist: attrs (==17.2.0)
Requires-Dist: backoff (==1.3.2)
Requires-Dist: pendulum (==1.2.0)
Requires-Dist: psycopg2 (==2.7.3.2)
Requires-Dist: singer-python (==5.0.4)

============
tap-redshift
============


`Singer <https://singer.io>`_ tap that extracts data from a `Redshift <https://aws.amazon.com/documentation/redshift/>`_ database and produces JSON-formatted data following the Singer spec.


Usage
=====
tap-redshift assumes you have connection to redshift.

Create a configuration file
---------------------------
When you install tap-redshift, you need to create a ``config.json`` file for the database connection.

The json file requires the following attributes;

* ``host``
* ``port``
* ``dbname``
* ``user``
* ``password``

And an optional attribute;

* ``schema``

Example:

.. code-block:: json

    {
        "host": "REDSHIFT_HOSTT",
        "port": "REDSHIFT_PORT",
        "dbname": "REDSHIFT_DBNAME",
        "user": "REDSHIFT_USER",
        "password": "REDSHIFT_PASSWORD",
        "schema": "REDSHIFT_SCHEMA"
    }


Discovery mode
==============
The tap can be invoked in discovery mode to get the available tables and columns in the database. It points to the config file created to connect to redshift:

.. code-block:: shell

    $ tap-redshift --config config.json -d

A full catalog tap is writtem to stdout, with a JSON-schema description of each table. A source table directly corresponds to a Singer stream.

Redirect output from the tap's discovery mode to a file so that it can be modified when the tap is to be invoked in sync mode.

.. code-block:: shell

    $ tap-redshift -c config.json -d > properties.json

This runs the tap in discovery mode and copies the output into a ``properties.json`` file.


Tables and property selection
=============================
In sync mode, tap-redshift consumes a modified version of the catalog where tables and fields have been marked as selected.

Edit ``properties.json`` file to make selections by adding key-value of ``"selected": "true"`` to the top level schema and also in the metadata for each properties you want to select.

Example:


.. code-block:: json

    {
        "tap_stream_id": "sample-stream-id",
        "table_name": "sample-name",
        "stream": "sample-stream",
        "is_view": false,
        "database_name": "sample-dbname"
        "schema": {
            "selected": "true",
            "properties": {
                "name": {
                    "maxLength": 255,
                    "inclusion": "available",
                    "type": [
                        "null",
                        "string"
                    ]
                },
                "id": {
                    "minimum": -2147483648,
                    "inclusion": "automatic",
                    "maximum": 2147483647,
                    "type": [
                        "null",
                        "integer"
                    ]
                }
            },
            "type": "object"
        },
        "metadata": [
            {
                "metadata": {
                    "selected": true,
                    "selected-by-default": true,
                    "sql-datatype": "int2"
                },
                "breadcrumb": [
                    "properties",
                    "id"
                ]
            },
            {
                "metadata": {
                    "selected": true,
                    "selected-by-default": true,
                    "sql-datatype": "varchar"
                },
                "breadcrumb": [
                    "properties",
                    "catname"
                ]
            },
        ]
    }

The tap can then be invoked in sync mode with the properties catalog argument:

.. code-block:: shell

    $ tap-redshift -c config.json --properties properties.json


Replication methods and state file
==================================
There are two ways to replicate a given table. FULL_TABLE and INCREMENTAL. FULL_TABLE replication is used by default.

Full Table
----------
Full-table replication extracts all data from the source table each time the tap is invoked without a state file.

Incremental
-----------
Incremental replication works in conjunction with a state file to only extract new records each time the tap is invoked i.e continue from the last synced data.

To use incremental replication, we need to add the ``replication_method`` and ``replication_key`` to the top level of the ``properties.json file``.

.. code-block:: json

    {
        "streams": [
            {
                "replication_method": "INCREMENTAL",
                "replication_key": "id",
                "tap_stream_id": "tap-sample",
                "schema": {
                    "properties": {
                        "name": {
                            "selected": "true",
                            "maxLength": 255,
                            "inclusion": "available",
                            "type": [
                                "null",
                                "string"
                            ]
                        },
                        "id": {
                            "selected": "true",
                            "minimum": -2147483648,
                            "inclusion": "automatic",
                            "maximum": 2147483647,
                            "type": [
                                "null",
                                "integer"
                            ]
                        }
                    }
                    "type": "object"
                }
            }
        ]
    }

We can then invoke the tap again in sync mode. This time the output will have ``STATE`` messages that contains a ``replication_key_value`` and ``bookmark`` for data that were extracted. 

Redirect the output to a ``state.json`` file. Normally, the target will echo the last STATE after it has finished processing data.

Run the code below to pass the state into a ``state.json`` file and then grab the last synced state data.

.. code-block:: shell

    $ tap-redshift -c config.json --properties properties.json > state.json

    $ tail -1 state.json > state.json.tmp && mv state.json.tmp state.json

The ``state.json`` file should look like;

.. code-block:: json

    {
        "currently_syncing": "dbname-tablename",
        "bookmarks": {
            "dev-category": {
                "replication_key": "id",
                "version": 1516304171710,
                "replication_key_value": 3
            }
        }
    }

We can then always invoke the incremental replication with the ``state.json`` file to only sync new data created after the last synced data.

.. code-block:: shell

    $ tap-redshift -c config.json --properties properties.json --state state.json

