You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This repository contains the implementation of our paper Bootstrapping Language Models via DPO Implicit Rewards. We show that the implicit reward model from the prior DPO training can be utilized to bootstrap and further align LLMs.
Provide the local path to this repo to DICE_DIR in two files:
scripts/run_dice/iter.sh
scripts/run_dice/pipeline.sh
E.g. DICE_DIR="/home/username/dice"
Training scripts
We provide sample training scripts for both Llama3 and Zephyr settings. It is recommended to run the script with 8x A100 GPUs. For other hardware environments, you might need to adjust the script.
Llama3
bash scripts/run_dice/iter.sh llama3
Zephyr
bash scripts/run_dice/iter.sh zephyr
Acknowledgement
This repo is built on LLaMA-Factory. Thanks for the amazing work!
Citation
Please consider citing our paper if you find the repo helpful in your work:
@inproceedings{chen2025bootstrapping,
title={Bootstrapping Language Models with DPO Implicit Rewards},
author={Chen, Changyu and Liu, Zichen and Du, Chao and Pang, Tianyu and Liu, Qian and Sinha, Arunesh and Varakantham, Pradeep and Lin, Min},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}
About
Official implementation of Bootstrapping Language Models via DPO Implicit Rewards