You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RENT: Reinforcement Learning via Entropy Minimization
RENT is an unsupervised method for training reasoning LLMs by minimizing entropy. We demonstrate on a variety of datasets and models that RENT improves model performance without using any ground truth labels!
RENT is featured in our paper "Maximizing Confidence Alone Improves Reasoning" (link)
Adjust the configuration in ppo_trainer.yaml to match your desired training configuration (number of gpus, batch size, etc.). To override this config somewhere else, see "Creating Custom Configurations"
python -m verl.trainer.main_ppo exps="[grpo, entropy, format, sampleval, aime]" base_model=Qwen/Qwen2.5-7B-Instruct
Running on Custom Datasets
See verl's documentation on how to prepare data and implement custom reward functions
We use an extensible config setup, allowing you to override default configurations for specific tasks/jobs.
To define a custom configuration, create a new yaml file in verl/trainer/config/exps. NOTE: you MUST include # @package _global_ at the beginning of the file in order to override other configs.
To use different configuration files, simply add them to the exps="[...]" argument to verl.trainer.main_ppo. Note: configurations are applied from left-to-right order, so configs to the right will override configs to the left!
Citation
@article{prabhudesai2025rent,
title={Maximizing Confidence Alone Improves Reasoning},
author={Prabhudesai, Mihir and Chen, Lili and Ippoliti, Alex and Fragkiadaki, Katerina and Liu, Hao and Pathak, Deepak},
journal={arXiv preprint arXiv:2505.22660},
year={2025}
}
About
RENT (Reinforcement Learning via Entropy Minimization) is an unsupervised method for training reasoning LLMs.