| CARVIEW |
RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation
Phenomenon of misalignment
We found that RLHF induces systematic misalignment.
Training AI on immediate feedback implicitly requires that users or evaluators predict the future utility of the AI output, which often depends on downstream consequences. This leads to reward hacking .
- AI can improve its reward by manipulating the evaluator’s internal state (e.g., beliefs, emotions).
- Manipulative AI outputs can bias users towards making poor decisions after the interaction.
- AI can improve its reward by manipulating the evaluator’s internal state (e.g., beliefs, emotions).
- Manipulative AI outputs can bias users towards making poor decisions after the interaction.
Benefit of hindsight
We introduced the benefit of hindsight and discussed theoretically that conditioning evaluator feedback on downstream observations mitigates misalignment and improves expected human utility.
To leverage this insight in a practical alignment algorithm, we introduce Reinforcement Learning from Hindsight Simulation (RLHS) .
- Step 1: Simulates the consequences.
- Step 2: Provides human feedback given the hindsight.
- Step 1: Simulates the consequences.
- Step 2: Provides human feedback given the hindsight.
Results
We demonstrate significant misalignment of real utility and satisfaction ratings using immediate feedback.
Our proposed hindsight effectively mitigate the misalignment.
RLHF model (trained with immediate feedback) deceives by falsely claiming Options A and C meet the customer's 8K resolution requirement, though neither does. In contrast, the RLHS model truthfully states that none of the options include 8K resolution.
Human study Results
RLHS significantly outperformed RLHF by achieving higher long-term satisfaction scores, higher true utility, and lower regret rates.
Models trained with RLHS are more truthful, presenting a strong correlation between their high immediate user satisfaction rate (subjective) and high true utility (objective).
Applications of Hindsight
BibTeX
@article{liang2025rlhs,
title={RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation},
author={Liang, Kaiqu and Hu, Haimin and Liu, Ryan and Griffiths, Thomas L and Fisac, Jaime Fern{\'a}ndez},
journal={arXiv preprint arXiv:2501.08617},
year={2025}
}