Carview!

Controllable Safety Alignment

This is the codebase for our ICLR 2025 paper Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements.

In addition to the codebase, we have publicly released the CoSApien dataset. Soon, we will release the CoSAlign model and synthetic datasets. Please see, our HuggingFace collection: Controllable Safety Alignment 🤗.

CoSApien👥: A human-authored benchmark: link
Llama3.1-8B-CoSAlign🤖: A safety-configurable Llama3.1-8B: Coming-Soon

If you have questions feel free to email the authors.

Evaluating Controllability

Please use run_eval_multistep.sh to evaluate controllability. This script pipelines the steps needed for evaluation:

(1) The candidate model generate response on the test set
(2) Generate evaluation response from the evaluator model
(3) Parse evaluation responses and aggregate final evaluation results

You will need to provide the name (or path) of the candidate model, a pretty name of the candidate model, and the name of the system prompt template as command line arguments.

Constructing the CoSAlign Training Data

We detail the process for CoSAlign data creation in the data_processing/ directory. Please see data_processing/README.md for details.

Conducting SFT and DPO

We use the code adapted from the DPO repo for conducting SFT and DPO. Please see more details in the dpo/ directory. Please see our provided example training script in dpo/train.sh.

Reference

If you find our work useful, please consider citing our paper:

@inproceedings{zhang2025controllablesafetyalignment,
      title={Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements}, 
      author={Jingyu Zhang and Ahmed Elgohary and Ahmed Magooda and Daniel Khashabi and Benjamin Van Durme},
      year={2025},
      url={https://arxiv.org/abs/2410.08968},
      booktitle = {International Conference on Learning Representations (ICLR)}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data_processing		data_processing
dpo		dpo
notebooks		notebooks
prompt_templates		prompt_templates
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
run_eval_multistep.sh		run_eval_multistep.sh
run_response_gen.sh		run_response_gen.sh
run_response_labeling.sh		run_response_labeling.sh
run_response_labeling_cascade_self.sh		run_response_labeling_cascade_self.sh
teaser.png		teaser.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Controllable Safety Alignment

Evaluating Controllability

Constructing the CoSAlign Training Data

Conducting SFT and DPO

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

microsoft/controllable-safety-alignment

Folders and files

Latest commit

History

Repository files navigation

Controllable Safety Alignment

Evaluating Controllability

Constructing the CoSAlign Training Data

Conducting SFT and DPO

Reference

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages