Metadata-Version: 2.1
Name: smashed
Version: 0.21.5
Summary: SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batching, and more. Supports datasets from Huggingface, torchdata iterables, or simple lists of dictionaries.
Author-email: Allen Institute for Artificial Intelligence <contact@allenai.org>, Luca Soldaini <luca@soldaini.net>, Kyle Lo <kylel@allenai.org>
Maintainer-email: Luca Soldaini <luca@soldaini.net>
License: Apache-2.0
Project-URL: Homepage, https://github.com/allenai/smashed
Project-URL: Repository, https://github.com/allenai/smashed
Project-URL: Bug Tracker, https://github.com/allenai/smashed/issues
Keywords: mappers,pytorch,torch,huggingface,transformers,datasets,dict,pipeline,preprocessing,nlp,natural language processing,text,prompting,prefix tuning,in context learning
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: necessary>=0.4.3
Requires-Dist: trouting>=0.3.3
Requires-Dist: ftfy>=6.1.1
Requires-Dist: platformdirs>=2.5.0
Requires-Dist: glom>=21.0.0
Requires-Dist: Jinja2>=3.0.3
Requires-Dist: numpy>=1.19.5
Provides-Extra: dev
Requires-Dist: black[jupyter]>=21.12b0; extra == "dev"
Requires-Dist: isort>=5.8.0; extra == "dev"
Requires-Dist: mypy>=0.971; extra == "dev"
Requires-Dist: pytest>=5.2; extra == "dev"
Requires-Dist: ipython>=8.4.0; extra == "dev"
Requires-Dist: autopep8>=1.7.0; extra == "dev"
Requires-Dist: flake8>=5.0; extra == "dev"
Requires-Dist: ipdb>=0.13.0; extra == "dev"
Requires-Dist: flake8-pyi>=22.8.1; extra == "dev"
Requires-Dist: Flake8-pyproject>=1.1.0; extra == "dev"
Requires-Dist: moto[all,ec2,s3]>=4.0.0; extra == "dev"
Provides-Extra: remote
Requires-Dist: smart-open>=5.2.1; extra == "remote"
Requires-Dist: boto3>=1.25.5; extra == "remote"
Provides-Extra: torch
Requires-Dist: torch>=1.9; extra == "torch"
Provides-Extra: datasets
Requires-Dist: smashed[torch]; extra == "datasets"
Requires-Dist: transformers>=4.5; extra == "datasets"
Requires-Dist: datasets>=2.8.0; extra == "datasets"
Requires-Dist: dill>=0.3.0; extra == "datasets"
Provides-Extra: prompting
Requires-Dist: smashed[torch]; extra == "prompting"
Requires-Dist: transformers>=4.5; extra == "prompting"
Requires-Dist: promptsource>=0.2.3; extra == "prompting"
Requires-Dist: blingfire>=0.1.8; extra == "prompting"
Provides-Extra: torchdata
Requires-Dist: torch>=1.13.1; extra == "torchdata"
Requires-Dist: torchdata>=0.5.1; extra == "torchdata"
Provides-Extra: all
Requires-Dist: smashed[dev]; extra == "all"
Requires-Dist: smashed[torch]; extra == "all"
Requires-Dist: smashed[datasets]; extra == "all"
Requires-Dist: smashed[torchdata]; extra == "all"
Requires-Dist: smashed[remote]; extra == "all"
Requires-Dist: smashed[prompting]; extra == "all"

![Colorful logo of smashed. It is the word smashed written in a playful font that vaguely looks like pipes.](https://github.com/allenai/smashed/raw/main/resources/smashed.png)

**S**equential **MA**ppers for **S**equences of **HE**terogeneous **D**ictionaries is a set of Python interfaces designed to apply transformations to samples in datasets, which are often implemented as sequences of dictionaries. To start, run

```bash
pip install smashed
```

## Example of Usage

Mappers are initialized and then applied sequentially. In the following example, we create a mapper that is applied to a samples, each containing a sequence of strings.
The mappers are responsible for the following operations.

1. Tokenize each sequence, cropping it to a maximum length if necessary.
2. Stride sequences together to a maximum length or number of samples.
3. Add padding symbols to sequences and attention masks.
4. Concatenate all sequences from a stride into a single sequence.

```python
import transformers
from smashed.mappers import (
    TokenizerMapper,
    MultiSequenceStriderMapper,
    TokensSequencesPaddingMapper,
    AttentionMaskSequencePaddingMapper,
    SequencesConcatenateMapper,
)

tokenizer = transformers.AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path='bert-base-uncased',
)

mappers = [
    TokenizerMapper(
        input_field='sentences',
        tokenizer=tokenizer,
        add_special_tokens=False,
        truncation=True,
        max_length=80
    ),
    MultiSequenceStriderMapper(
        max_stride_count=2,
        max_length=512,
        tokenizer=tokenizer,
        length_reference_field='input_ids'
    ),
    TokensSequencesPaddingMapper(
        tokenizer=tokenizer,
        input_field='input_ids'
    ),
    AttentionMaskSequencePaddingMapper(
        tokenizer=tokenizer,
        input_field='attention_mask'
    ),
    SequencesConcatenateMapper()
]

dataset = [
    {
        'sentences': [
            'This is a sentence.',
            'This is another sentence.',
            'Together, they make a paragraph.',
        ]
    },
    {
        'sentences': [
            'This sentence belongs to another sample',
            'Overall, the dataset is made of multiple samples.',
            'Each sample is made of multiple sentences.',
            'Samples might have a different number of sentences.',
            'And that is the story!',
        ]
    }
]

for mapper in mappers:
    dataset = mapper.map(dataset)

print(len(dataset))

# >>> 5

print(dataset[0])

# >>> {
#    'input_ids': [
#        101,
#        2023,
#        2003,
#        1037,
#        6251,
#        1012,
#        102,
#        2023,
#        2003,
#        2178,
#        6251,
#        1012,
#        102
#    ],
#    'attention_mask': [
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1
#    ]
# }
```

## Building a Pipeline

Mappers can also be composed into a pipeline using the `>>` (or `<<`) operator. For example, the code above can be rewritten as follows:

```python
pipeline = TokenizerMapper(
    input_field='sentences',
    tokenizer=tokenizer,
    add_special_tokens=False,
    truncation=True,
    max_length=80
) >> MultiSequenceStriderMapper(
    max_stride_count=2,
    max_length=512,
    tokenizer=tokenizer,
    length_reference_field='input_ids'
) >> TokensSequencesPaddingMapper(
    tokenizer=tokenizer,
    input_field='input_ids'
) >> AttentionMaskSequencePaddingMapper(
    tokenizer=tokenizer,
    input_field='attention_mask'
) >> SequencesConcatenateMapper()

dataset = ...

# apply the full pipeline to the dataset
pipeline.map(dataset)
```

## Dataset Interfaces Available

The initial version of SMASHED supports two interfaces for dataset:

1. **`interfaces.simple.Dataset`**: A simple dataset representation that is just a list of python dictionaries with some extra convenience methods to make it work with SMASHED. You can crate a simple dataset by passing a list of dictionaries to `interfaces.simple.Dataset`.
2. **HuggingFace `datasets` library**. SMASHED mappers work with any datasets from HuggingFace, whether it is a regular or iterable dataset.

## Developing SMASHED

To contribute to SMASHED, make sure to:

1. (If you are not part of AI2) Fork the repository on GitHub.
2. Clone it locally.
3. Create a new branch in for the new feature.
4. Install development dependencies with `pip install -r dev-requirements.txt`.
5. Add your new mapper or feature.
6. Add unit tests.
7. Run tests, linting, and type checking from the root directory of the repo:
    1. *Style:* `black .` (Should format for you)
    2. *Style:* `flake8 .`  (Should return no error)
    3. *Style:* `isort .` (Should sort imports for you)
    4. *Static type check:* `mypy .` (Should return no error)
    5. *Tests:* `pytest -v --color=yes tests/` (Should return no error)
8. Commit, push, and create a pull request.
9. Tag `soldni` to review the PR.

### A note about versioning

SMASHED follows [Semantic Versioning](https://semver.org/). In short, this means that the version number is MAJOR.MINOR.PATCH, where:

- MAJOR version when you make incompatible API changes,
- MINOR version when you add functionality in a backwards compatible manner; adding a mapper typically falls under this category, and
- PATCH version when you make backwards compatible bug fixes.
