| CARVIEW |

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise
unwanted content. Evaluating these attacks presents a number of challenges, and the current
landscape of benchmarks and evaluation techniques is fragmented. First, assessing whether LLM
responses are indeed harmful requires open-ended evaluations which are not yet standardized.
Second, existing works compute attacker costs and success rates in incomparable ways. Third,
some works lack reproducibility as they withhold adversarial prompts or code, and rely on changing
proprietary APIs for evaluation. Consequently, navigating the current literature and tracking
progress can be challenging.
To address this, we introduce JailbreakBench, a centralized benchmark with the following components:
- Repository of jailbreak artifacts. An evolving dataset of state-of-the-art adversarial prompts at https://github.com/JailbreakBench/artifacts, referred to as jailbreak artifacts, which are explicitly required for submissions to our benchmark to ensure reproducibility.
- Standardized evaluation framework. Our library at https://github.com/JailbreakBench/jailbreakbench that includes a clearly defined threat model, system prompts, chat templates, and scoring functions.
- Leaderboard. Our leaderboards here (https://jailbreakbench.github.io/) that track the performance of attacks and defenses for various LLMs.
- Dataset. A representative dataset named JBB-Behaviors at https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors composed of 100 distinct misuse behaviors (with 55% original examples and the rest sourced from AdvBench and TDC/HarmBench) divided into ten broad categories corresponding to OpenAI's usage policies. Moreover, now it is complemented with 100 benign behaviors that can be used to quickly evaluate overrefusal rates for new models and defenses.
Leaderboard: Open-Source Models
Leaderboard: Closed-Source Models
Contribute to JailbreakBench
Citation
- AdvBench
- The Trojan Detection Challenge 2023 Red Teaming Track/HarmBench
- Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation by Shah et al.
@inproceedings{chao2024jailbreakbench,
title={JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models},
author={Patrick Chao and Edoardo Debenedetti and Alexander Robey and Maksym Andriushchenko and Francesco Croce and Vikash Sehwag and Edgar Dobriban and Nicolas Flammarion and George J. Pappas and Florian Tramèr and Hamed Hassani and Eric Wong},
booktitle={NeurIPS Datasets and Benchmarks Track},
year={2024}
}