❌ Traditional reward models = Slow 🚶
🔹They score entire responses post-generation 📜
🔹LLMs must generate fully before evaluation ⏳
✅ GenARM = Fast 🏎️
🔹 Predicts next-token rewards on the fly ⚡
🔹 Guides LLMs token by token—drastically improving efficiency! 💡
| CARVIEW |
GenARM : Reward Guided Generation with Autoregressive Reward Model for Test-Time Alignment
Furong Huang1, Sumitra Ganesh2
ICLR, 2025
Abstract
Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences.
Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated
training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining.
However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive
text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment
approach that leverages the Autoregressive Reward Model--a novel reward parametrization designed to predict next-token rewards
for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution
achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time
alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with
smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference
dimensions and catering to diverse user preferences without retraining.
TL;DR: GenARM uses an autoregressive reward model to efficiently guide a base LLM for test-time alignment, outperforming prior methods and enabling weak-to-strong guidance and multi-objective alignment.
Why Do We Need Autoregressive Reward Model?
What’s an Autoregressive Reward Model?
Unlike conventional trajectory-level reward models, GenARM parametrizes rewards at the token level:
🔹 Rewards decompose naturally as log probabilities 🔄
🔹 Each token selection is guided dynamically 🎯
Parametrization of the Autoregressive Reward Model.
How Does GenARM Work?
Training Phase:
✅ Learns next-token rewards from trajectory-level preference data 📊
✅ Ensures that preferred responses accumulate higher total rewards 💯
🚀 Inference Phase:
✅ Combines LLM logits + next-token rewards to dynamically guide generation 🔄
💡 No model retraining. Just plug, play, and align!
How Well Does it Perform?
🔥 Fastest test-time alignment method—significantly outperforms baselines!
🔥 Achieves 90% of fine-tuned performance—without retraining!
🔥 Weak-to-strong guidance: Uses a 7B RM to align a 70B LLM, saving HUGE compute costs!
💡 More power, less compute! 🏆
BibTeX
@inproceedings{
xu2025genarm,
title={GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-Time Alignment},
author={Xu, Yuancheng and Sehwag, Udari Madhushani and Koppel, Alec and Zhu, Sicheng and An, Bang and Huang, Furong and Ganesh, Sumitra},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
}