You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each example in the json file looks like this (example from WMT) --
{
"query": "这一切,身在海外的华人华侨感受更为深刻。",
"retrieval": "身在上海,是一种亲历才懂的情感。",
"query_response_k": "All this, the overseas Chinese living overseas feel more deeply.",
"query_response_j": "All of this, the overseas Chinese feel even more deeply.",
"retrieval_response_k": "Being in Shanghai is a kind of emotion that you know.",
"retrieval_response_j": "Being in Shanghai is a kind of emotion that can only be understood through experience."
}
query and retrieval correspond to the two (inputs of) the instructions. *_response_k is the human preferred response for query and retrieval respectively. *_response_j is a less preferred response, that is NOT used in our reward consistency metrics.
WIP; We are still cleaning + organizing code for release. Please reach out to lshen30[at]jhu.edu and sihaoc[at]cis.upenn.edu for questions.
Citation
@article{shen2023trickle,
title={The Trickle-down Impact of Reward (In-)consistency on RLHF},
author={Lingfeng Shen and Sihao Chen and Linfeng Song and Lifeng Jin and Baolin Peng and Haitao Mi and Daniel Khashabi and Dong Yu},
year={2023},
journal={arXiv preprint arXiv:2309.16155},
url={https://arxiv.org/pdf/2309.16155.pdf}
}