Metadata-Version: 2.1
Name: dfm-sentence-transformers
Version: 0.3.3
Summary: Module for finetuning dfm base-models to sentence transformers
License: MIT
Author: Márton Kardos
Author-email: power.up1163@gmail.com
Requires-Python: >=3.8,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: catalogue (>=2.0.0,<3.0.0)
Requires-Dist: confection (>=0.1.0,<0.2.0)
Requires-Dist: datasets (>=2.14.0,<3.0.0)
Requires-Dist: radicli (>=0.0.25,<0.0.26)
Requires-Dist: sentence-transformers (>=2.2.0,<3.0.0)
Requires-Dist: torch (>=2.1.0,<3.0.0)
Description-Content-Type: text/markdown

<img align="left" width="82" height="82" src="assets/logo.svg">

# dfm-sentence-transformers

<br>

Sentence transformers for the Danish Foundation Models Project.

## Training

Install the package from PyPI:

```bash
pip install dfm-sentence-transformers
```

You have to specify basic model and training parameters, as well as all the tasks/datasets the model should be trained on.

Here is an example of a config:
```
[model]
name="dfm-sentence-encoder-small-v1"
base_model="chcaa/dfm-encoder-small-v1"
device="cpu"

[training]
epochs=50
steps_per_epoch=500
warmup_steps=100
batch_size=64
wandb_project="dfm-sentence-transformers"
checkpoint_repo="checkpoints-dfm-sentence-encoder-small-v1"

[tasks]

[tasks.bornholmsk]
@tasks="multiple_negatives_ranking"
sentence1="da_bornholm"
sentence2="da"

[tasks.bornholmsk.dataset]
@loaders="load_dataset"
path="strombergnlp/bornholmsk_parallel"

```

Then you can train a sentence transformer by using the `finetune` command.

```bash
python3 -m dfm_sentence_trf finetune training.cfg -o "model/"
```

You can push the finetuned model to HuggingFace Hub:

```bash
python3 -m dfm_sentence_trf push_to_hub training.cfg --model_path "model/"
```

## Evaluation

You can evaluate trained models with the [Scandinavian Embedding Benchmark](https://kennethenevoldsen.github.io/scandinavian-embedding-benchmark/).

```bash
pip install seb
python3 -m seb "model/" "da"
```

## Tasks

You can add an arbitrary number of tasks to the model's config.
All tasks must have a unique name but their name is ignored in the actual training procedure.
Datasets of tasks with the same loss function are mixed together so that the model can learn them simultaneously in mixed batches.
The package comes with three default tasks you can use for different objectives:

### 1. Multiple Negatives Ranking
If you have a parallel corpus of sentences (paraphrase, translation, etc.) use this task.
Batches consist of positive sentence pairs, and negative samples are constructed by taking all non-matching pairs in a batch.

#### Parameters:

Param | Type | Description | Default
----- | ---- | ----------- | -------
sentence1 | str | Name of the first sentence column in the dataset. | -
sentence2 | str | Name of the second sentence column in the dataset. | -
scale | float | Output of similarity function is multiplied by scale value. | 20.0

```
[tasks.faroese]
@tasks="multiple_negatives_ranking"
sentence1="fo"
sentence2="da"

[tasks.faroese.dataset]
@loaders="load_dataset"
path="strombergnlp/itu_faroese_danish"

```

### 2. Cosine Similarity
Good for STS datasets.
Minimizes mean squared error of estimated and true sentence cosine similairites.

#### Parameters:

Param | Type | Description | Default
----- | ---- | ----------- | -------
sentence1 | str | Name of the first sentence column in the dataset. | -
sentence2 | str | Name of the second sentence column in the dataset. | -
similarity | str | Name of the gold standard similarity column. | -

```
[tasks.sts]
@tasks="cosine_similarity"
sentence1="sent1"
sentence2="sent1"
similarity="label"

[tasks.sts.dataset]
...
```

### 3. Softmax
Good for NLI datasets. Uses softmax classification loss based on concatenated embeddings and their difference.
Beware that these tasks are never joined due to potentially different labeling schemes.

#### Parameters:

Param | Type | Description | Default
----- | ---- | ----------- | -------
sentence1 | str | Name of the first sentence column in the dataset. | -
sentence2 | str | Name of the second sentence column in the dataset. | -
label | str | Name of the label column in the dataset. | -

```
[tasks.nli]
@tasks="softmax"
sentence1="premise"
sentence2="hypothesis"
label="label"

[tasks.nli.dataset]
...
```

## Datasets

Datasets for each task are loaded with :hugs: `load_dataset()` function, but only the first argument, and a name are accepted.
You can use local or remote datasets, and they can be of any of the canonical file formats (JSON, JSONL, CSV, Parquet...).

```
...

[tasks.local.dataset]
@loaders="load_dataset"
path="local/dataset/file.jsonl"

...

[tasks.huggingface_hub.dataset]
@loaders="load_dataset"
path="username/dataset"
```



