| CARVIEW |
ACL 2025 Tutorial
Guardrails and Security for LLMs:
Safe, Secure, and Controllable Steering of LLM Applications
About this tutorial
Pretrained generative models, especially large language models, provide novel ways for users to interact with computers. While generative NLP research and applications had previously aimed at very domain-specific or task-specific solutions, current LLMs and applications (e.g. dialogue systems, agents) are versatile across many tasks and domains. Despite being trained to be helpful and aligned with human preferences (e.g., harmlessness), enforcing robust guardrails on LLMs remains a challenge. And, even when protected against rudimentary attacks, just like other complex software, LLMs can be vulnerable to attacks using sophisticated adversarial inputs.
This tutorial provides a comprehensive overview of key guardrail mechanisms developed for LLMs, along with evaluation methodologies and a detailed security assessment protocol - including auto red-teaming of LLM-powered applications. Our aim is to move beyond the discussion of single prompt attacks and evaluation frameworks towards addressing how guardrailing can be done in complex dialogue systems that employ LLMs.
We aim to provide a cutting-edge and complete overview of deployment risks associated with LLMs in production environments. While the main focus will be on how to effectively protect against safety and security threats, we also tackle the more recent topic of providing dialogue and topical rails, including respecting custom policies. We also examine the new attack vectors introduced by LLM-enabled dialogue systems, such as methods for circumventing dialogue steering.
Schedule (tentative)
| Tutorial topic | Duration |
|---|---|
| Introduction [slides] | 5 min |
| Types of LLM guardrails | |
| Guardrails and LLM security | |
| Content moderation and safety [slides] | 35 min |
| Taxonomies of safety risks | |
| Landscape of safety models and datasets | |
| Synthetic data generation for LLM safety | |
| Custom safety policies | |
| Safety and reasoning models | |
| System level considerations | |
| LLM security [slides] | 30 min |
| Overview | |
| Tools for assessing LLM security | |
| Auto red-teaming | |
| Adversarial attacks | |
| Alignment attacks [slides] | 20 min |
| Data poisoning and sleeper agents | |
| Instruction hierarchy | |
| Trojan horse and safety backdoors | |
| Coffee break (3:30-4pm CET) | 30 min |
| Dialogue rails and security [slides] | 20 min |
| Dialogue and topical rails | |
| Evaluation of dialogue rails | |
| Multi-turn/dialogue attacks and protection | |
| Multilingual guardrails [slides] | 15 min |
| Multilingual safety models | |
| Inference-time steering for safety [slides] | 20 min |
| Activation-based steering | |
| Circuit breakers | |
| Inference-time steering for concept / topical guardrails | |
| LLM agent safety [slides] | 30 min |
| Safety challenges and measures for different types of basic agents | |
| Assessing agent safety | |
| Multi-agent safety risks | |
| Multi-agents for enhancing AI safety | |
| Final recommendations | 5 min |
| Total | 180 min |
Reading List
Bold papers are the suggested reading list.
Content moderation and safety
- Ai risk categorization decoded (air 2024): From government regulations to corporate policies (Zeng et al., 2024)
- Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms (Han et al., 2024)
- AEGIS2. 0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails (Ghosh et al., 2025)
- Controllable safety alignment: Inference-time adaptation to diverse safety requirements (Zhang et al., 2024)
- Bingoguard: Llm content moderation tools with risk levels (Yin et al., 2025)
- Polyguard: A multilingual safety moderation tool for 17 languages (Kumar et al., 2025)
Multilingual guardrails
- The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It (Yong et al., 2025)
- OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities (Verma et al., 2025)
- PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages (Kumar et al., 2025)
- Teaching LLMs to Abstain across Languages via Multilingual Feedback (Feng et al., 2024)
- MPO: Multilingual Safety Alignment via Reward Gap Optimization (Zhao et al., 2025)
Inference-time steering for safety
- InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance (Wang et al., 2024)
- Towards Inference-time Category-wise Safety Steering for Large Language Models (Bhattacharjee et al., 2024)
- Improving alignment and robustness with circuit breakers (Zou et al., 2024)
- Steering Language Model Refusal with Sparse Autoencoders (O’Brien et al., 2024)
- AXBENCH: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders (Wu et al., 2025)