With the rapid development of Generative AI, ensuring their safety, security, and
trustworthiness is paramount. In response, researchers and practitioners have proposed red
teaming to identify such risks, enabling their mitigation. Red teaming refers to adversarial
tactics employed to identify flaws in GenAI-based systems, such as security vulnerabilities,
harmful or discriminating outputs, privacy breaches, and copyright law violations.While several recent works proposed comprehensive evaluation frameworks for AI models, the rapid
evolution of AI necessitates ongoing updates to benchmarks to avoid them from becoming outdated
due to models being excessively tailored to these benchmarks. Moreover, such evaluations must
also incorporate the latest findings from AI safety research, which consistently expose new
breaches in generative models.
In response to the findings from red teaming exercises, researchers have taken action to curb
undesirable behaviors in AI models through various methods. These include aligning the models
with ethical standards, defending against jailbreak attempts, preventing the generation of
untruthful content, erasing undesired concepts from the models, and even leveraging adversaries
for beneficial purposes. Despite these efforts, a multitude of risks remain unresolved,
underscoring the importance of continuous research in addressing the challenges identified
through red teaming. The goal of this workshop is to bring leading researchers on AI safety together to discuss
pressing real-world challenges faced by ever-evolving generative models. We put a
special emphasis on red teaming and quantitative evaluations towards probing the limitations of
our models. Some fundamental questions that this workshop will address include
- What are new security and safety risks in foundation models?
- How do we discover and quantitatively evaluate harmful capabilities of these models?
- How can we mitigate risks found through red teaming?
- What are the limitations of red teaming?
- Can we make safety guarantees?