Metadata-Version: 2.3
Name: megatron-energon
Version: 2.3.0
Summary: Megatron's multi-modal data loader
Project-URL: Homepage, https://github.com/NVIDIA/Megatron-Energon
Author-email: Lukas Vögtle <lvoegtle@nvidia.com>, Philipp Fischer <pfischer@nvidia.com>
License-Expression: BSD-3-Clause
License-File: LICENSE
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Requires-Dist: braceexpand
Requires-Dist: click
Requires-Dist: numpy
Requires-Dist: pillow>=10.0.1
Requires-Dist: pyyaml
Requires-Dist: s3fs
Requires-Dist: torch
Requires-Dist: tqdm
Requires-Dist: webdataset
Provides-Extra: dev
Requires-Dist: black; extra == 'dev'
Requires-Dist: isort; extra == 'dev'
Requires-Dist: myst-parser; extra == 'dev'
Requires-Dist: sphinx; extra == 'dev'
Requires-Dist: sphinx-click; extra == 'dev'
Requires-Dist: sphinx-rtd-theme; extra == 'dev'
Requires-Dist: sphinxcontrib-napoleon; extra == 'dev'
Provides-Extra: transforms
Requires-Dist: torchvision; extra == 'transforms'
Description-Content-Type: text/markdown

<a name="top"></a>

<div align="center">
  <h3 align="center">Megatron's multi-modal data loader</h3>
  <h3 align="center">Megatron Energon</h3>
  <p align="center">
    <a href="https://github.com/NVIDIA/Megatron-Energon/actions/workflows/tests.yml"><img src="https://github.com/NVIDIA/Megatron-Energon/actions/workflows/tests.yml/badge.svg" alt="Tests"></a> <a href="https://nvidia.github.io/Megatron-Energon/"><img src="https://github.com/NVIDIA/Megatron-Energon/actions/workflows/documentation.yml/badge.svg" alt="Documentation"></a>
    <br />
    <a href="https://github.com/NVIDIA/Megatron-Energon/issues">Report Bug</a>
    ·
    <a href="https://github.com/NVIDIA/Megatron-Energon/issues">Request Feature</a>
  </p>
</div>

<br />

 _**DISCLAIMER**: This package contains research code. APIs may change._

# What is this?

**Megatron Energon** is the multi-modal data loader of [Megatron](https://github.com/NVIDIA/Megatron-LM) (you can also use it independently).

It's best at

- loading large training data to train large multi-modal models
- blending many different datasets together
- distributing the work across many nodes and processes of a cluster
- ensuring reproducibility and resumability
- adapting easily to various types of data samples and processing

Try using it together with [Megatron](https://github.com/NVIDIA/Megatron-LM) Core.

# Quickstart
**Megatron Energon** is a pip-installable python package that offers
- dataset-related classes that you can import in your project
- a command line utility for data preprocessing and conversion

This document is just a quick start. Please also check out the [documentation](https://nvidia.github.io/Megatron-Energon/).

## Installation

```shell
pip install megatron-energon
```
Or
```shell
pip install git+https://github.com/NVIDIA/Megatron-Energon.git
```

**NOTE**: We encourage you to install the package (and not just import a local copy). This will ensure you have all the needed dependencies and that you can use the command line tool.

For more details on installing this package, see [here](https://nvidia.github.io/Megatron-Energon/installation.html).

## Usage of command line tool

After installation, the command `energon` will be available.

Here are some examples for things you can do:

| Command | Description  |
|---|---|
| `energon prepare DATASET_ROOT` | Take an existing WebDataset and add the required yaml files to turn it into an energon-compatible dataset |
| `energon lint DATASET_ROOT` | Verify that the dataset complies with the energon dataset format and that all samples are loadable |


## Usage of the library

To get started, pick a [WebDataset](https://github.com/webdataset/webdataset)-compliant dataset and run `energon prepare DATASET_ROOT` on it, to run the interactive assistant and create the `.nv-meta` folder.

Once done, try to load it from your Python program:

```python
from megatron.energon import get_train_dataset, get_loader


train_loader = get_loader(get_train_dataset(
    '/my/dataset/path',
    batch_size=32,
    shuffle_buffer_size=None,
    max_samples_per_sequence=None,
))

for batch in train_loader:
    # Do something with batch
    # Infer, gradient step, ...
    pass
```

For more details, read the [documentation](https://nvidia.github.io/Megatron-Energon/).

Most likely, you'll need your own [task encoder](https://nvidia.github.io/Megatron-Energon/task_encoders.html).
