Metadata-Version: 2.1
Name: walledeval
Version: 0.0.2.dev0
Summary: An open-source toolkit to test LLMs against jailbreaks and unprecedented harms.
Home-page: https://github.com/walledai/walledeval
License: MIT
Keywords: NLP,deep learning,transformer,language model,jailbreaking,red-teaming
Author: Rishabh Bhardwaj
Requires-Python: >=3.10,<4.0
Classifier: Development Status :: 1 - Planning
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Dist: anthropic (>=0.25.6,<0.26.0)
Requires-Dist: datasets (>=2.19.0,<3.0.0)
Requires-Dist: google-generativeai (>=0.5.2,<0.6.0)
Requires-Dist: openai (>=1.23.6,<2.0.0)
Requires-Dist: pydantic (>=2.7.1,<3.0.0)
Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
Requires-Dist: transformers (>=4.40.1,<5.0.0)
Project-URL: Bug Tracker, https://github.com/walledai/walledeval/issues
Project-URL: Repository, https://github.com/walledai/walledeval
Description-Content-Type: text/markdown

# walledeval

> _Test LLMs against jailbreaks and unprecedented harms_

<!-- [![Python Package tests status](https://github.com/three-body-analysis/tris/actions/workflows/python-package.yml/badge.svg)](https://github.com/three-body-analysis/tris/actions?query=workflow%3Apython-package) -->
<!-- [![Docs CI status](https://github.com/three-body-analysis/tris/actions/workflows/docs.yml/badge.svg)](https://three-body-analysis.github.io/tris/) -->
[![PyPI Latest Release](https://img.shields.io/pypi/v/walledeval.svg)](https://pypi.org/project/walledeval/)
<!-- [![PyPI Downloads](https://static.pepy.tech/badge/walledeval)](https://pepy.tech/project/walledeval) -->

WalledEval is a simple library to test LLM safety by identifying if text generated by the LLM is indeed safe. We purposefully test benchmarks with negative information and toxic prompts to see if it is able to flag prompts of malice.

## Basic Usage

### LLMs (`walledeval.llm`)

We support the following LLM types:


| Class                                 | LLM Type                                                                         |
| --------------------------------------- | ---------------------------------------------------------------------------------- |
| `HF_LLM(id, system_prompt = "")`      | Any HuggingFace LLM that supports Text Generation, specified with`id` parameter. |
| `Claude(api_key, system_prompt = "")` | Claude 3 Opus                                                                    |

Usage is as follows:

```python
>>> from walledeval.llm import HF_LLM, Claude

>>> hf_llm = HF_LLM("<insert llm identifier>")
>>> hf_llm.generate("How are you?")
# <output>

>>> claude = Claude("INSERT_API_KEY")
>>> claude.generate("How are you?")
# <output>
```

A custom abstract `llm.LLM` class is also defined to support other LLMs, which takes in the model identifier `name` and optional system prompt `system_prompt`, and an abstract method `generate(text: str) -> str`.

### Judges (`walledeval.judge`)

Judges are used to identify if outputs are malignant. We currently support the judge `ClaudeJudge`, which uses Claude 3 Opus and a custom-defined taxonomy to test malignant outputs. It returns `False` if malignant (i.e. it didn't pass the test).

Usage is as follows:

```python
>>> from walledeval.judge import ClaudeJudge

>>> judge = ClaudeJudge("INSERT_API_KEY")
>>> judge.check("<insert output>")
# <boolean output>
```

A custom abstract `judge.Judge` class is also defined to support other possible judges, which takes in the judge identifier `name`, and an abstract method `check(text: str) -> bool`.

### Benchmarks (`walledeval.benchmark`)

Benchmarks are available to provide datasets to test both the LLM and Judges. We currently test the following benchmarks:


| Benchmark Name                         | Class  |
| ---------------------------------------- | -------- |
| [WMDP Benchmark](https://www.wmdp.ai/) | `WMDP` |

Usage is as follows:

```python
>>> from walledeval.benchmark import WMDP

>>> wmdp = WMDP()

>>> wmdp.test(llm, judge)
# <logs>
# generator[logs]
```

A custom abstract `benchmark.Benchmark` class is also defined for you to define your own benchmarks. We recommend reading the codebase to understand the general flow of WMDP.

