RENT: Reinforcement Learning via Entropy Minimization

RENT is an unsupervised method for training reasoning LLMs by minimizing entropy. We demonstrate on a variety of datasets and models that RENT improves model performance without using any ground truth labels!

RENT is featured in our paper "Maximizing Confidence Alone Improves Reasoning" (link)

RENT is built on top of the verl library.

Updates:

[10/31]: we release a checkpoint RENT-Qwen-7B of Qwen2.5-7B-Instruct trained with RENT on AIME24. The model can be found at this huggingface link

Installation:

Please refer to the existing verl quickstart for installation : verl Installation

For the specific vllm version we used (along with other packages), see the requirements.txt file.

Example: Run RENT on Qwen2.5-7B-Instruct using the AIME24 dataset

Prepare AIME data:

python ./examples/data_preprocess/aime.py --local_dir {path_to_your_dataset}

Run Training:

Adjust the configuration in ppo_trainer.yaml to match your desired training configuration (number of gpus, batch size, etc.). To override this config somewhere else, see "Creating Custom Configurations"

python -m verl.trainer.main_ppo exps="[grpo, entropy, format, sampleval, aime]" base_model=Qwen/Qwen2.5-7B-Instruct

Running on Custom Datasets

See verl's documentation on how to prepare data and implement custom reward functions

Data and Reward Preparation
- Prepare Data for Post-Training
- Implement Reward Function for Dataset

Creating Custom Configurations

We use an extensible config setup, allowing you to override default configurations for specific tasks/jobs.

To define a custom configuration, create a new yaml file in verl/trainer/config/exps. NOTE: you MUST include # @package _global_ at the beginning of the file in order to override other configs.

To use different configuration files, simply add them to the exps="[...]" argument to verl.trainer.main_ppo. Note: configurations are applied from left-to-right order, so configs to the right will override configs to the left!

Citation

@article{prabhudesai2025rent,
    title={Maximizing Confidence Alone Improves Reasoning},
    author={Prabhudesai, Mihir and Chen, Lili and Ippoliti, Alex and Fragkiadaki, Katerina and Liu, Hao and Pathak, Deepak},
    journal={arXiv preprint arXiv:2505.22660},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docker		docker
docs		docs
examples		examples
media		media
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
run.sh		run.sh
setup.py		setup.py
sync.sh		sync.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RENT: Reinforcement Learning via Entropy Minimization

Updates:

Installation:

Example: Run RENT on Qwen2.5-7B-Instruct using the AIME24 dataset

Running on Custom Datasets

Creating Custom Configurations

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

satrams/rent-rl

Folders and files

Latest commit

History

Repository files navigation

RENT: Reinforcement Learning via Entropy Minimization

Updates:

Installation:

Example: Run RENT on Qwen2.5-7B-Instruct using the AIME24 dataset

Running on Custom Datasets

Creating Custom Configurations

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages