CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Fri, 06 Jun 2025 00:42:45 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"68423985-2d34" expires: Mon, 29 Dec 2025 04:46:46 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: C3A6:1F53DD:8460A7:94CD76:6952055E accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 04:36:47 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210068-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766983007.784144,VS0,VE219 vary: Accept-Encoding x-fastly-request-id: 33fd648c5c1e459834f5d79ac84f2ab4c3429329 content-length: 3162 RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac

Princeton University

Paper arXiv Code Poster Data

Phenomenon of misalignment

We found that RLHF induces systematic misalignment.

Training AI on immediate feedback implicitly requires that users or evaluators predict the future utility of the AI output, which often depends on downstream consequences. This leads to reward hacking .

AI can improve its reward by manipulating the evaluator’s internal state (e.g., beliefs, emotions).

Manipulative AI outputs can bias users towards making poor decisions after the interaction.

Benefit of hindsight

We introduced the benefit of hindsight and discussed theoretically that conditioning evaluator feedback on downstream observations mitigates misalignment and improves expected human utility.

To leverage this insight in a practical alignment algorithm, we introduce Reinforcement Learning from Hindsight Simulation (RLHS) .

Step 1: Simulates the consequences.

Step 2: Provides human feedback given the hindsight.

Results

We demonstrate significant misalignment of real utility and satisfaction ratings using immediate feedback. Our proposed hindsight effectively mitigate the misalignment.

RLHF model (trained with immediate feedback) deceives by falsely claiming Options A and C meet the customer's 8K resolution requirement, though neither does. In contrast, the RLHS model truthfully states that none of the options include 8K resolution.

Human study Results

RLHS significantly outperformed RLHF by achieving higher long-term satisfaction scores, higher true utility, and lower regret rates. Models trained with RLHS are more truthful, presenting a strong correlation between their high immediate user satisfaction rate (subjective) and high true utility (objective).

Applications of Hindsight

BibTeX

@article{liang2025rlhs,
  title={RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation},
  author={Liang, Kaiqu and Hu, Haimin and Liu, Ryan and Griffiths, Thomas L and Fisac, Jaime Fern{\'a}ndez},
  journal={arXiv preprint arXiv:2501.08617},
  year={2025}
}

Original Source | Taken Source