You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages – ciphers.
If you have any questions, please feel free to email the first author: Youliang Yuan.
👉 Paper
For more details, please refer to our paper ICLR 2024.
LOVE💗 and Peace🌊
RESEARCH USE ONLY✅ NO MISUSE❌
Our results
We provide our results (query-response pairs) in experimental_results, these files can be loaded by torch.load(). Then, you can get a list: the first element is the config and the rest of the elements are the query-response pairs.
python3 main.py \
--model_name gpt-4-0613 \
--data_path data/data_en_zh.dict \
--encode_method caesar \
--instruction_type Crimes_And_Illegal_Activities \
--demonstration_toxicity toxic \
--language en
🔧 Argument Specification
--model_name: The name of the model to evaluate.
--data_path: Select the data to run.
--encode_method: Select the cipher to use.
--instruction_type: Select the domain of data.
--demonstration_toxicity: Select the toxic or safe demonstrations.
--language: Select the language of the data.
💡Framework
Our approach presumes that since human feedback and safety alignments are presented in natural language, using a human-unreadable cipher can potentially bypass the safety alignments effectively. Intuitively, we first teach the LLM to comprehend the cipher clearly by designating the LLM as a cipher expert, and elucidating the rules of enciphering and deciphering, supplemented with several demonstrations. We then convert the input into a cipher, which is less likely to be covered by the safety alignment of LLMs, before feeding it to the LLMs. We finally employ a rule-based decrypter to convert the model output from a cipher format into the natural language form.
📃Results
The query-responses pairs in our experiments are all stored in the form of a list in the "experimental_results" folder, and torch.load() can be used to load data.
If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:
@inproceedings{
yuan2024cipherchat,
title={{GPT}-4 Is Too Smart To Be Safe: Stealthy Chat with {LLM}s via Cipher},
author={Youliang Yuan and Wenxiang Jiao and Wenxuan Wang and Jen-tse Huang and Pinjia He and Shuming Shi and Zhaopeng Tu},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=MbfAK4s61A}
}
About
A framework to evaluate the generalization capability of safety alignment for LLMs