Metadata-Version: 2.1
Name: dbxconfig
Version: 1.0.2
Summary: Databricks Configuration Framework
Home-page: https://dbxconfig.readthedocs.io/en/latest/
Author: Shaun Ryan
Author-email: shaun_chiburi@hotmail.com
License: MIT
Project-URL: GitHub, https://github.com/semanticinsight/dbxconfig
Project-URL: Documentation, https://dbxconfig.readthedocs.io/en/latest/
Platform: UNKNOWN
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Description-Content-Type: text/markdown

# dbxconfig

Configuration framework for databricks pipelines.
Define configuration and table dependencies in yaml config then get the table mappings config model:

Define your tables.

```yaml
landing:
  landing_dbx_patterns:
    customer_details_1: null
    customer_details_2: null

raw:
  raw_dbx_patterns:
    customers:
      ids: id
      depends_on:
        - landing.landing_dbx_patterns.customer_details_1
        - landing.landing_dbx_patterns.customer_details_2

base:
  base_dbx_patterns:
    customer_details_1:
      ids: id
      depends_on:
        - raw.raw_dbx_patterns.customers
    customer_details_2:
      ids: id
      depends_on:
        - raw.raw_dbx_patterns.customers
```

Define you load configuration:

```yaml
tables: ./test/Config/tables.yaml

landing:
  trigger: customerdetailscomplete-{{filename_date_format}}*.flg
  trigger_type: file
  database: landing_dbx_patterns
  table: "{{table}}"
  container: datalake
  root: "/mnt/{{container}}/data/landing/dbx_patterns/{{table}}/{{path_date_format}}"
  filename: "{{table}}-{{filename_date_format}}*.csv"
  filename_date_format: "%Y%m%d"
  path_date_format: "%Y%m%d"
  format: cloudFiles
  spark_schema: ./test/Schema/{{table.lower()}}.yaml
  options:
    # autoloader
    cloudFiles.format: csv
    cloudFiles.schemaLocation:  /mnt/{{container}}/checkpoint/{{checkpoint}}
    cloudFiles.useIncrementalListing: auto
    # schema
    inferSchema: false
    enforceSchema: true
    columnNameOfCorruptRecord: _corrupt_record
    # csv
    header: false
    mode: PERMISSIVE
    encoding: windows-1252
    delimiter: ","
    escape: '"'
    nullValue: ""
    quote: '"'
    emptyValue: ""
    

raw:
  database: raw_dbx_patterns
  table: "{{table}}"
  container: datalake
  root: /mnt/{{container}}/data/raw
  path: "{{database}}/{{table}}"
  options:
    checkpointLocation: /mnt/{{container}}/checkpoint/{{database}}_{{table}}
    mergeSchema: true
```

Import the config objects into you pipeline:

```python
from dbxconfig import Config, Timeslice, StageType

# build path to configuration file
pattern = "auto_load_schema"
config_path = f"./Config/{pattern}.yaml"

# create a timeslice object for slice loading. Use * for all time (supports hrs, mins, seconds and sub-second).
timeslice = Timeslice(day="*", month="*", year="*")

# parse and create a config objects
config = Config(timeslice=timeslice, config_path=config_path)

# get the configuration for a table mapping to load.
table_mapping = config.get_table_mapping(
    timeslice=timeslice, 
    stage=StageType.raw, 
    table="customers"
)
```

## Development Setup

```
pip install -r requirements.txt
```

## Unit Tests

To run the unit tests with a coverage report.

```
pip install -e .
pytest test/unit --junitxml=junit/test-results.xml --cov=dbxconfig --cov-report=xml --cov-report=html
```

## Build

```
python setup.py sdist bdist_wheel
```

## Publish


```
twine upload dist/*
```


