Metadata-Version: 2.1
Name: sqlgym
Version: 0.1.2
Summary: SQLGym: A portable Gymnasium environment of SQLite database.
Author-email: KYLN24 <1296845690@qq.com>
Project-URL: Repository, https://github.com/KYLN24/sqlgym.git
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: gymnasium
Requires-Dist: tqdm
Provides-Extra: react
Requires-Dist: openai ; extra == 'react'
Provides-Extra: sft
Requires-Dist: sqlgym[react] ; extra == 'sft'
Requires-Dist: torch ; extra == 'sft'
Requires-Dist: transformers ; extra == 'sft'
Requires-Dist: deepspeed ; extra == 'sft'
Requires-Dist: datasets ; extra == 'sft'
Requires-Dist: accelerate ; extra == 'sft'

# SQLGym

This is a portable Gymnasium environment of SQLite database. It is designed for platforms that are not able to use docker. (e.g. users without root privillege)

## Setup

Simply `pip install sqlgym`. If you want to generate ReAct dataset and fine tune a model, please clone the repository and install from source.

```bash
# Clone this repository
git clone https://github.com/KYLN24/sqlgym.git
# or via SSH
# git clone git@github.com:KYLN24/sqlgym.git

cd sqlgym

# Install this package
pip install ".[sft]"
```

## Prepare Dataset

```
# Make a directory to save data
mkdir .data
cd .data
```

This project currently suppport the BIRD-SQL dataset.

```bash
mkdir bird
cd bird

# Download BIRD-SQL Dataset
wget -c https://bird-bench.oss-cn-beijing.aliyuncs.com/train.zip
unzip train.zip
cd train
unzip train_databases.zip
cd ..

wget -c https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip
unzip dev.zip
cd dev
unzip dev_databases.zip
cd ..
```

# Usage

```python
from sqlgym import SqlGymEnv
from sqlgym.datasets import BirdDataset

dataset = BirdDataset(
    bird_path=".data/bird",
    mode="dev",
)

env = SqlGymEnv(dataset)

print(env.reset(0))
print(env.step(dataset[0].gt))
```

# SFT

You can use `scripts/make_datasets.py` to generate a SFT dataset.

```bash
python -u scripts/make_datasets.py --bird_path=./data/bird # Dataset will be created at ./data/bird/train.jsonl and ./data/bird/dev.jsonl
```

You can use `scripts/make_react_dataset.py` to convert it to ReAct format with thought generated by GPT.

```bash
# Edit the script to add your OpenAI api_key.
# Change base_url and other generation parameters as you wish.
python -u scripts/make_react_dataset.py \
       --data_path=.data/bird/train.jsonl \
       --save_path=.data/bird/train_react.jsonl
```

Then, use `scripts/train.py` or `scripts/train_react.py` to fine tune a chat model. The tokenizer should support the `apply_chat_template` method.

```bash
torchrun --nproc_per_node=8 scripts/train.py \
         --model=meta-llama/Llama-2-7b-chat-hf \
         --train_set=.data/bird/train.jsonl \
         --output_dir=.data/output

torchrun --nproc_per_node=8 scripts/train.py \
         --model=meta-llama/Llama-2-7b-chat-hf \
         --train_set=.data/bird/train_react.jsonl \
         --output_dir=.data/output \
         --react
```
