Metadata-Version: 2.1
Name: red-panda
Version: 0.1.2
Summary: Pandas and AWS interoperability for data science.
Home-page: https://github.com/yaojiach/red-panda
Author: Jiachen Yao
Maintainer: Jiachen Yao
License: MIT
Project-URL: Code, https://github.com/yaojiach/red-panda
Project-URL: Issue tracker, https://github.com/yaojiach/red-panda/issues
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Requires-Python: >=3.6
Provides-Extra: dev
Requires-Dist: pandas
Requires-Dist: psycopg2-binary
Requires-Dist: boto3
Provides-Extra: dev
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: tox; extra == 'dev'

Red Panda 🐼😊
================

Data science on AWS without frustration.

Features
--------

- DataFrame/files to and from S3 and Redshift.
- Run queries on Redshift in Python.
- Manage files on S3.


Installation
------------

.. code-block:: console

    $ pip install red-panda


Using red-panda
---------------

Import `red-panda` and create an instance of `RedPanda`. If you create the instance with `debug` on (i.e. `rp = RedPanda(redshift_conf, s3_conf, debug=True)`), `red-panda` will print the planned queries instead of executing them.

.. code-block:: python

    from red_panda import RedPanda

    redshift_conf = {
        'user': 'awesome-developer',
        'password': 'strong-password',
        'host': 'awesome-domain.us-east-1.redshift.amazonaws.com',
        'port': 5432,
        'dbname': 'awesome-db',
    }

    s3_conf = {
        'aws_access_key_id': 'your-aws-access-key-id',
        'aws_secret_access_key': 'your-aws-secret-access-key',
        # 'aws_session_token': 'temporary-token-if-you-have-one',
    }

    rp = RedPanda(redshift_conf, s3_conf)


Load your Pandas DataFrame into Redshift as a new table.

.. code-block:: python

    import pandas as pd

    df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

    s3_bucket = 's3-bucket-name'
    s3_path = 'parent-folder/child-folder' # optional, if you don't have any sub folders
    s3_file_name = 'test.csv' # optional, randomly generated if not provided
    rp.df_to_redshift(df, 'test_table', bucket=s3_bucket, path=s3_path, append=False)


It is also possible to: 

- Upload a DataFrame or flat file to S3
- Delete files from S3
- Load S3 data into Redshift


.. code-block:: python

    s3_key = s3_path + '/' + s3_file_name
    rp.df_to_s3(df, s3_bucket, s3_key)

    rp.delete_from_s3(s3_bucket, s3_key)

    pd.to_csv(df, 'test_data.csv', index=False)
    rp.file_to_s3('test_data.csv', s3_bucket, s3_key)


    redshift_column_datatype = {
        'col1': 'int',
        'col2': 'int',
    }
    rp.s3_to_redshift(
        s3_bucket, s3_key, 'test_table', column_definition=redshift_column_datatype
    )


For API documentation, visit https://red-panda.readthedocs.io/en/latest/.


TODO
----

In no particular order:

- Improve tests and docs.
- Better ways of inferring data types from dataframe to Redshift.
- Explore using `S3 Transfer Manager`'s upload_fileobj for `df_to_s3` to take advantage of automatic multipart upload.
- Add encryption options for files uploaded to S3.
- Add COPY from S3 manifest file, in addition to COPY from S3 source path.
- Support more data formats.
- Build cli to manage data outside of Python.

In progress:

- Take advantage of Redshift slices for parallel processing. Split files for COPY.

Done:

- Handle when user does have implicit column that is the index in a DataFrame. Currently index is automatically dropped.


