Metadata-Version: 2.1
Name: fw-dataset
Version: 0.1.0rc10
Summary: A library for working with Flywheel datasets
Author: joshicola
Author-email: joshuajacobs@flywheel.io
Requires-Python: >=3.10,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: adlfs (>=2024.7.0,<2025.0.0)
Requires-Dist: deepdiff (>=8.0.1,<9.0.0)
Requires-Dist: duckdb (>=1.1.1,<2.0.0)
Requires-Dist: flywheel-sdk (>=19.1.0,<20.0.0)
Requires-Dist: fsspec (>=2024.9.0,<2025.0.0)
Requires-Dist: fw-client (>=0.8.6,<0.9.0)
Requires-Dist: gcsfs (>=2024.9.0.post1,<2025.0.0)
Requires-Dist: orjson (>=3.10.7,<4.0.0)
Requires-Dist: pandas (>=2.2.3,<3.0.0)
Requires-Dist: pyarrow (>=17.0.0,<18.0.0)
Requires-Dist: pydantic (>=2.9.2,<3.0.0)
Requires-Dist: s3fs (>=2024.9.0,<2025.0.0)
Description-Content-Type: text/markdown

# fw-dataset <!-- omit in toc -->

This repository contains classes and functions for creating, managing, and serving
Flywheel Datasets. Flywheel Datasets are a way to organize, share, and query data from
the Flywheel Data Model.

- [Work In Progress](#work-in-progress)
- [Getting started](#getting-started)
  - [Installation](#installation)
  - [Usage](#usage)
    - [Accessing Datasets](#accessing-datasets)
    - [Rendering Datasets](#rendering-datasets)
    - [Unassociated Datasets](#unassociated-datasets)
    - [Merging Related Datasets](#merging-related-datasets)
      - [Requirements](#requirements)
- [Flywheel Project Requirements](#flywheel-project-requirements)
  - [Flywheel Project Structure](#flywheel-project-structure)
    - [type](#type)
    - [bucket](#bucket)
    - [prefix](#prefix)
    - [storage\_id](#storage_id)
  - [Dataset Structure](#dataset-structure)
    - [Schema Files](#schema-files)
- [Future Development](#future-development)

## Work In Progress

This is a work in progress. All functionality is not yet implemented.

## Getting started

### Installation

The `fw-dataset` package has been built for use with Python 3.10 and above. It can be
installed with pip:

```bash
pip install fw-dataset
```

or poetry:

```bash
poetry add fw-dataset
```

### Usage

#### Accessing Datasets

The `fw-dataset` package provides a `FWDatasetClient` class that can be used to access
existing Flywheel datasets on cloud storage or local filesystems.

```python
from fw_dataset import FWDatasetClient

# Create a client with a Flywheel API-Key
api_key = "your-api-key"
dataset_client = FWClient(api_key=api_key)

# If you are in a Flywheel Jupyter Workspace with the environment variables 
# FW_HOSTNAME and FW_WS_API_KEY set, the following will work:
# dataset_client = FWClient()

# list existing datasets (see below for Flywheel Project Requirements)
datasets = dataset_client.datasets()

# link to a specific project-associated dataset
# by project id
project_id = "your-project-id"
dataset = dataset_client.dataset(project_id=project_id)

# or by project path
group = "your-group"
project_label = "your-project-label"
dataset = dataset_client.dataset(project_path=f"fw://{group}/{project_label}")

# connect the dataset to all underlying data
conn = dataset.connect()

# query the dataset
SQL = "SELECT * FROM acquisitions"

# get the results
results = conn.execute(SQL)
result_df = results.df()
result_df.head()
```

#### Rendering Datasets

The `fw-dataset` package provides a `DatasetBuilder` class that can be used to render a
dataset from a Flywheel project. The `DatasetBuilder` renders the dataset structure and
metadata from a Flywheel project into a local or cloud storage structure.

```python
from fw_dataset.admin.dataset_builder import DatasetBuilder

# Create a client with a Flywheel API-Key
api_key = "your-api-key"
project_id = "your-project-id"
storage_id = "your-storage-id"

# Initialize the dataset builder with an api-key, project-id, and storage-id
dataset_builder =  DatasetBuilder(api_key=api_key, project_id=project_id, storage_id=storage_id)

# Render the dataset structure and metadata
dataset = dataset_builder.render_dataset()

# Connect to the dataset
conn = dataset.connect()

# Query the dataset
SQL = "SELECT * FROM subjects LIMIT 10"
conn.execute(SQL).df()
```

The [Dataset Structure](#dataset-structure) will be rendered in the storage bucket or
local storage under the path specified by:

`{bucket}/datasets/{instance}/{group}/{project_id}/latest/`

If the `latest` directory already exists, and is the version you are trying to render,
the Dataset object is returned. If the `latest` directory does not exist, the `latest`
directory is created and the Dataset object is returned. If you creating a new dataset
from a current project snapshot is desired, use the `force_new` parameter:

```python
dataset = dataset_builder.render_dataset(force_new=True)
```

Additionally, if you want to render a projects tabular data files and custom information
into dataset tables and schemas, you must use the following flags:

```python
dataset = dataset_builder.render_dataset(parse_tabular_data=True, parse_custom_info=True)
```

#### Unassociated Datasets

If you have a valid dataset that is not associated with a Flywheel project, you can still
use the `FWDatasetClient` to access the dataset. You will need to provide the
`type`,`bucket`, `prefix`, and `credentials` of cloud or local filesystem to instantiate
and query the dataset.

```python
from fw_dataset import FWDatasetClient

# There is no need to provide an API-Key or instantiate the dataset client

fs_type = "s3" # or "gcs", "azure", "fs", "local"
bucket = "your-bucket"
prefix = "your-prefix"
credentials = {"url": "{bucket-specific-credential-string}"}

dataset = FWDatasetClient.get_dataset_from_filesystem(fs_type, bucket, prefix, credentials)
```

#### Merging Related Datasets

If you have multiple datasets that have related tables you want to query together, you
can merge the datasets into a single dataset.

NOTE: Federated Querying is not yet enabled across datasets. This is a work in progress.

##### Requirements

1. The `source` dataset must have a valid `tables` directory structure.
2. The `source` dataset must have a valid `schemas` directory structure.
    - Every table in the `tables` directory must have a valid corresponding schema file
      in the `schemas` directory.
    - The schema file must be named `{table_name}.schema.json` where `{table_name}` is
      the name of the table that the schema describes.
    - The schema file must be a valid JSON file with the minimum structure:

        ```json
        {
            "schema": "http://json-schema.org/draft-07/schema#",
            "id": "{table_name}",
            "description": "",
            "properties": {},
            "required": [],
            "type": "object"
        }
        ```

3. The `destination` dataset must have the same requirements as the `source` dataset.
4. Tables and schemas selected from the `source` MUST NOT have the same names as
   existing ones in the `destination`

Once the above requirements have been met, you may merge the datasets by copying or
moving the selected tables and schemas from the `source` dataset to the `destination`
dataset.

## Flywheel Project Requirements

For the Flywheel Dataset Client and the Dataset objects to function, the following
requirements must be met:

### Flywheel Project Structure

The Flywheel Project must have the following valid custom information metadata:

```json
{
    "dataset": {
        "type": "s3",
        "bucket": "bucket-name",
        "prefix": "path/to/dataset",
        "storage_id": "storage-id-of-fw-storage-object"
    }
}
```

#### type

The `type` field must be one of the following:

- `s3`: The dataset is stored in an S3 bucket.
- `gcs`: The dataset is stored in a Google Cloud Storage bucket.
- `azure`: The dataset is stored in an Azure Blob Storage container.
- `fs`,`local`: The dataset is stored on a local filesystem.

#### bucket

The `bucket` field is the name of the bucket or container where the dataset is stored.

#### prefix

The `prefix` field is the path to the dataset within the bucket or container.

The directory structure beneath the `prefix` should be as described in the
[Dataset Structure](#dataset-structure) section.

#### storage_id

The `storage_id` field is the Flywheel ID of the cloud storage record that describes the
filesystem or cloud storage bucket that the dataset is stored in. This should be a valid
storage object in the Flywheel database.

### Dataset Structure

The dataset should be stored in the bucket or container with the following structure:

```bash
{bucket}/{prefix}/
├── latest/
|   └── latest/
|       ├── provenance/
|       │   └── dataset_description.json
|       ├── tables/
|       │   └── {table_name}/ (a directory structure of partitioned parquet files)
|       │       └── /{partitions}/{hash}.parquet
|       └── schemas/
|          └── {table_name}.schema.json
└── versions/          
  ├── latest_version.json (provenance/dataset_description.json of versions/latest)
  └── {version}/
      ├── provenance/
      │   └── dataset_description.json
      ├── tables/
      │   └── {table_name}/ (a directory structure of partitioned parquet files)
      │       └── /{partitions}/{hash}.parquet
      └── schemas/
         └── {table_name}.schema.json
```

The `latest_version.json` file is a copy of the `provenance/dataset_description.json`.
Both of these are minimal descriptions of a dataset version. The `latest` directory
represents the latest version of the dataset. Archived versions of the dataset are also
stored in the `versions` directory for archival purposes. They can be deleted once they
are no longer needed.

The above structure is more completely described in the
[Dataset Definition](docs/Dataset_Definition.md#dataset-components) Document in the
`docs` directory.

#### Schema Files

The schema files are JSON files that describe the schema of the tables in the dataset.
The schema files are stored in the `schemas` directory. The schema files are named
`{table_name}.schema.json` where `{table_name}` is the name of the table that the schema
describes.

Ideally, the schema files should be fully descriptive. However, if a minimal schema is
desired merely to allow the dataset to be queried, the schema file can be as simple as:

```json
{
    "schema": "http://json-schema.org/draft-07/schema#",
    "id": "{table_name}",
    "description": "Table derived from Tabular Data File: conditions.csv",
    "properties": {},
    "required": [],
    "type": "object"
}
```

## Future Development

Future development will include:

- [ ] Dataset creation and management from library
  - Create a new dataset from a Flywheel project
  - Dataset will be structured on local or cloud storage
  - Dataset essentials will be stored in the Flywheel project metadata
  - Dataset versions can be deleted from the storage structure
  - Dataset versions can be archived
  - Dataset can be removed from a Flywheel project

