CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://austrian-code-wizard.github.io/c3po-website/ x-github-request-id: 6C43:3655F2:77E521:864A88:6950F886 accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 09:29:42 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210044-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766914183.760982,VS0,VE199 vary: Accept-Encoding x-fastly-request-id: 403ad1dbcb96f8c20231fc762c53c06654508b40 content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Mon, 19 Feb 2024 02:35:10 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"65d2be5e-40c9" expires: Sun, 28 Dec 2025 09:39:43 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 7FF0:2916CC:77FF0F:8664E7:6950F886 accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 09:29:43 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210044-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766914183.973280,VS0,VE217 vary: Accept-Encoding x-fastly-request-id: f9c2bd137c85093b2f28ce8d40c352ecd2d014b4 content-length: 4577 RLVF: Learning from Verbal Feedback without Overgeneralization

RLVF: Learning from Verbal Feedback without Overgeneralization

Moritz Pascal Stephan ¹, Alexander Khazatsky¹, Eric Mitchell¹, Annie S Chen¹,
Sheryl Hsu¹, Archit Sharma¹, Chelsea Finn¹

¹Stanford University;

Corresponding authors: moritzst@stanford.edu

arXiv PDF Code Demo

Abstract

Large language models (LLMs) are increasingly deployed for various industries and users, necessitating the ability to align them with specific use cases and user preferences. Standard methods for such adaptation, such as reinforcement learning from human feedback, require extensive manual annotations. Alternatively, prompting-based approaches to incorporating verbal feedback are efficient but struggle to appropriately incorporate nuanced, context dependent user preferences, often overgeneralizing the feedback to contexts where it should not apply. We study whether it is possible to adapt language models using verbal feedback without such overgeneralization. Crucially, we propose Contextualized Critiques with Constrained Preference Optimization (C3PO), where we first introduce a scheme for synthetically generating both preference data that is relevant and irrelevant to the provided feedback. Then, we finetune the language model in accordance with the synthetic preference data while minimizing the divergence from the original model on out-of-scope prompts. Our experimental results indicate that our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors in irrelevant contexts. Across many examples of human and GPT-4 generated feedback, C3PO effectively adheres to the given feedback comparably to incontext baselines while reducing overgeneralization by 30%.

Method overview

Dataset generation

Three sub-datasets: D_in-scope to demonstrate desired change of behavior, D_out-of-scope to maintain behaviour outside of scope of feedback, and D_near-scope to refine model's understanding of when to apply feedback
For generation, use GPT-4 to first generate K categories of prompts feedback could apply to and then specific prompts that the feedback could apply to (in-scope) or superficially related to the feedback but model's behaviour should not change (near-scope). For out-of-scope, take a set of feedback independent prompts.

Model training

C3PO facilitates feedback adherence for relevant prompts by fine-tuning with DPO on the generated in-scope data while minimizing overgeneralization through SCD losses on the generated out-of-scope and near-scope data, which regularizes model behavior towards the original model for feedback-irrelevant prompts.

Results

C3PO substantially reduces overgeneralization (applying the given feedback to prompts where it is not actually relevant) with only minor reduction in adherence to feedback for prompts where the feedback is relevant.

Acknowledgements

We thank Modal.com for sponsoring the compute for this project. We thank OpenAI for providing API credits through their Researcher Access Program.

BibTeX

@misc{stephan2024rlvf,
    title={RLVF: Learning from Verbal Feedback without Overgeneralization}, 
    author={Moritz Stephan and Alexander Khazatsky and Eric Mitchell and Annie S Chen and Sheryl Hsu and Archit Sharma and Chelsea Finn},
    year={2024},
    eprint={2402.10893},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Original Source | Taken Source