Metadata-Version: 2.1
Name: quanto
Version: 0.0.2
Summary: A quantization toolkit for pytorch.
Author-email: David Corvoysier <david@huggingface.co>
License: Apache-2.0
Keywords: torch,quantization
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.1.0

# Quanto

**DISCLAIMER**: this package is still an early prototype (pre-beta version), and not (yet) an HuggingFace product. Expect breaking changes and drastic modifications in scope and features.

🤗 Quanto is a python quantization toolkit that provides several features that are either not supported or limited by the base [pytorch quantization tools](https://pytorch.org/docs/stable/quantization.html):

- all features are available in eager mode (works with non-traceable models),
- quantized models can be placed on any device (including CUDA),
- automatically inserts quantization and dequantization stubs,
- automatically inserts quantized functional operations,
- automatically inserts quantized modules (see below the list of supported modules),
- provides a seamless workflow from float model to dynamic to static quantized model,
- supports quantized model serialization as a `state_dict`.

Features yet to be implemented:

- quantize clone (quantization happens in-place for now),
- optimized integer kernels,
- quantized operators fusion,
- support `int4` weights,
- compatibility with [torch compiler](https://pytorch.org/docs/stable/torch.compiler.html) (aka dynamo).

## Supported modules

The following modules can be quantized:

- [Linear]() (QLinear). Weights are quantized to `int8`, adn biases to `int32`. Outputs are quantized to `int8`.

The next modules to be implemented are normalization layers, to allow the quantization of attention blocks:

- [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html),
- LLamaRMSNorm.

## Limitations and design choices

Quanto uses a strict affine quantization scheme (no zero-point).

Quanto does not support mixed-precision quantization.

Although Quanto uses integer activations and weights, the current implementation falls back to `float32` operations for integer inputs, which means that no benefits are expected in terms of latency (weight storage and on-device memory usage should be lower).

## Installation

Quanto is available as a pip package.

```
pip install quanto
```

## Quantization workflow

Quanto does not make a clear distinction between dynamic and static quantization: models are always dynamically quantized, but their weights can later be "frozen" to
integer values.

A typical quantization workflow would consist in the following steps:

1. Quantize

The first step converts a standard float model into a dynamically quantized model.

```
quantize(model)
```

2. Calibrate (optional)

Activations are quantized using a default `[-1, 1]` range which can lead to severe clipping and/or inaccurate values.

Quanto supports a calibration mode that allows to adjust the activation ranges while passing representative samples through the quantized model.

```
with calibration():
    model(samples)
```

Note that during calibration, all activations and weights are dequantized and inference happens with float precision.

3. Tune, aka Quantization-Aware-Training (optional)

If the performances of the model are too degraded, one can tune it for a few epochs to recover the float model performances.

```
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()
    output = model(data).dequantize()
    loss = torch.nn.functional.nll_loss(output, target)
    loss.backward()
    optimizer.step()
```

4. Freeze integer weights

When freezing a model, its float weights are replaced by quantized integer weights.

```
freeze(model)
```

Please refer to the [examples](https://github.com/huggingface/quanto/tree/main/examples) for instantiations of that worklow.

## Implementation details

Under the hood, Quanto uses a `torch.Tensor` subclass (`QTensor`) to dispatch `aten` base operations to integer operations.

All integer operations accept `QTensor` with `int8` data.

Most arithmetic operations return a `QTensor` with `int32` data.

In addition to the quantized tensors, Quanto uses quantized modules as substitutes to some base torch modules to:

- store quantized weights,
- gather input and output scales to rescale QTensor `int32` data to `int8`.

Eventually, the produced quantized graph should be passed to a specific inductor backend to fuse rescale into the previous operation.

Examples of fused operations can be found in https://github.com/Guangxuan-Xiao/torch-int.
