Metadata-Version: 2.1
Name: databricks-dbapi
Version: 0.6.0
Summary: A DBAPI 2.0 interface and SQLAlchemy dialect for Databricks interactive clusters.
Home-page: https://github.com/crflynn/databricks-dbapi
License: MIT
Keywords: databricks,hive,dbapi,sqlalchemy,dialect
Author: Christopher Flynn
Author-email: crf204@gmail.com
Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Database
Classifier: Topic :: Database :: Front-Ends
Provides-Extra: hive
Provides-Extra: odbc
Provides-Extra: sqlalchemy
Requires-Dist: pyhive (>=0.6.1,<0.7.0); extra == "hive"
Requires-Dist: pyodbc (>=4.0.30,<5.0.0); extra == "odbc"
Requires-Dist: sqlalchemy (>=1.3,<2.0); extra == "sqlalchemy"
Requires-Dist: thrift (>=0.15.0,<0.16.0); extra == "hive"
Project-URL: Documentation, https://github.com/crflynn/databricks-dbapi
Project-URL: Repository, https://github.com/crflynn/databricks-dbapi
Description-Content-Type: text/x-rst

databricks-dbapi
================

|pypi| |pyversions|

.. |pypi| image:: https://img.shields.io/pypi/v/databricks-dbapi.svg
    :target: https://pypi.python.org/pypi/databricks-dbapi

.. |pyversions| image:: https://img.shields.io/pypi/pyversions/databricks-dbapi.svg
    :target: https://pypi.python.org/pypi/databricks-dbapi

A thin wrapper around `pyhive <https://github.com/dropbox/PyHive>`__ and `pyodbc <https://github.com/mkleehammer/pyodbc>`__ for creating a `DBAPI <https://www.python.org/dev/peps/pep-0249/>`__ connection to Databricks Workspace and SQL Analytics clusters. SQL Analytics clusters require the `Simba ODBC driver <https://databricks.com/spark/odbc-driver-download>`__.

Also provides SQLAlchemy Dialects using ``pyhive`` and ``pyodbc`` for Databricks clusters. Databricks SQL Analytics clusters only support the ``pyodbc``-driven dialect.

Installation
------------

Install using pip. You *must* specify at least one of the extras {``hive`` or ``odbc``}. For ``odbc`` the `Simba driver <https://databricks.com/spark/odbc-driver-download>`__ is required:

.. code-block:: bash

    pip install databricks-dbapi[hive,odbc]


For SQLAlchemy support install with:

.. code-block:: bash

    pip install databricks-dbapi[hive,odbc,sqlalchemy]

Usage
-----

PyHive
~~~~~~

The ``connect()`` function returns a ``pyhive`` Hive connection object, which internally wraps a ``thrift`` connection.

Connecting with ``http_path``, ``host``, and a ``token``:

.. code-block:: python

    import os

    from databricks_dbapi import hive


    token = os.environ["DATABRICKS_TOKEN"]
    host = os.environ["DATABRICKS_HOST"]
    http_path = os.environ["DATABRICKS_HTTP_PATH"]


    connection = hive.connect(
        host=host,
        http_path=http_path,
        token=token,
    )
    cursor = connection.cursor()

    cursor.execute("SELECT * FROM some_table LIMIT 100")

    print(cursor.fetchone())
    print(cursor.fetchall())


The ``pyhive`` connection also provides async functionality:

.. code-block:: python

    import os

    from databricks_dbapi import hive
    from TCLIService.ttypes import TOperationState


    token = os.environ["DATABRICKS_TOKEN"]
    host = os.environ["DATABRICKS_HOST"]
    cluster = os.environ["DATABRICKS_CLUSTER"]


    connection = hive.connect(
        host=host,
        cluster=cluster,
        token=token,
    )
    cursor = connection.cursor()

    cursor.execute("SELECT * FROM some_table LIMIT 100", async_=True)

    status = cursor.poll().operationState
    while status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):
        logs = cursor.fetch_logs()
        for message in logs:
            print(message)

        # If needed, an asynchronous query can be cancelled at any time with:
        # cursor.cancel()

        status = cursor.poll().operationState

    print(cursor.fetchall())


ODBC
~~~~

The ODBC DBAPI requires the Simba ODBC driver.

Connecting with ``http_path``, ``host``, and a ``token``:

.. code-block:: python

    import os

    from databricks_dbapi import odbc


    token = os.environ["DATABRICKS_TOKEN"]
    host = os.environ["DATABRICKS_HOST"]
    http_path = os.environ["DATABRICKS_HTTP_PATH"]


    connection = odbc.connect(
        host=host,
        http_path=http_path,
        token=token,
        driver_path="/path/to/simba/driver",
    )
    cursor = connection.cursor()

    cursor.execute("SELECT * FROM some_table LIMIT 100")

    print(cursor.fetchone())
    print(cursor.fetchall())


SQLAlchemy Dialects
-------------------

databricks+pyhive
~~~~~~~~~~~~~~~~~

Installing registers the ``databricks+pyhive`` dialect/driver with SQLAlchemy. Fill in the required information when passing the engine URL.

.. code-block:: python

    from sqlalchemy import *
    from sqlalchemy.engine import create_engine
    from sqlalchemy.schema import *


    engine = create_engine(
        "databricks+pyhive://token:<databricks_token>@<host>:<port>/<database>",
        connect_args={"http_path": "<cluster_http_path>"}
    )

    logs = Table("my_table", MetaData(bind=engine), autoload=True)
    print(select([func.count("*")], from_obj=logs).scalar())


databricks+pyodbc
~~~~~~~~~~~~~~~~~

Installing registers the ``databricks+pyodbc`` dialect/driver with SQLAlchemy. Fill in the required information when passing the engine URL.

.. code-block:: python

    from sqlalchemy import *
    from sqlalchemy.engine import create_engine
    from sqlalchemy.schema import *


    engine = create_engine(
        "databricks+pyodbc://token:<databricks_token>@<host>:<port>/<database>",
        connect_args={"http_path": "<cluster_http_path>", "driver_path": "/path/to/simba/driver"}
    )

    logs = Table("my_table", MetaData(bind=engine), autoload=True)
    print(select([func.count("*")], from_obj=logs).scalar())


Refer to the following documentation for more details on hostname, cluster name, and http path:

* `Databricks <https://docs.databricks.com/user-guide/bi/jdbc-odbc-bi.html>`__
* `Azure Databricks <https://docs.azuredatabricks.net/user-guide/bi/jdbc-odbc-bi.html>`__


Related
-------

* `pyhive <https://github.com/dropbox/PyHive>`__
* `thrift <https://github.com/apache/thrift/tree/master/lib/py>`__
* `pyodbc <https://github.com/mkleehammer/pyodbc>`__

