We're excited to present our latest research contribution: Exploratory Annealed Decoding (EAD) for Verifiable Reinforcement Learning!
Our codebase is based on verl RL framework and vLLM inference engine. We made our best efforts to save the commit history so everyone can easily find what we have updated. We are working to integrate our methods to both upstreams. Stay tuned!
If you use this code as part of any published research, please acknowledge the following paper:
@article{yang2025ead,
title={Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning},
author={Yang, Chenghao and Gui, Lin and Yang, Chenxiao and Veitch, Victor and Zhang, Lizhu and Zhao, Zhuokai},
journal={arXiv preprint arXiv:2510.05251},
year={2025}
}EAD is a simple yet effective exploration strategy for Reinforcement Learning with Verifiable Rewards (RLVR) that addresses a fundamental challenge: achieving effective exploration while preserving sample quality and ensuring training stability.
Core Insight: Exploration is not equally valuable at every step. Early tokens shape a sequence's semantic direction, making early exploration crucial for discovering diverse valid solutions. Later tokens fill in details where excessive exploration can harm coherence.
Our Strategy: Explore at the beginning, exploit at the end
EAD implements an intuitive temperature annealing schedule that:
- Starts with high temperature (τ > 1) to encourage diverse exploration of solution paths
- Gradually cools to lower temperatures to ensure coherent, high-quality completions
- Maintains proximity to the target policy for stable off-policy learning
EAD uses a dynamic temperature schedule that starts high and gradually decreases:
Where:
-
$\tau_t$ is the temperature at token position$t$ -
$\tau_\mathrm{max} > 1$ is the maximum temperature for exploration -
$\tau_\mathrm{min}$ is the minimum temperature for exploitation -
$d$ is the decay rate controlling annealing speed
The decay rate is made global-step-aware to adapt to increasing response lengths:
Where
- Plug-and-Play Enhancement: Improves sample efficiency over fixed-temperature sampling
- Broad Compatibility: Works with various RLVR algorithms (GRPO, DAPO, EntropyMech)
- Sample Efficient: Achieves strong results with fewer rollouts
- Inference-Time Benefits: Also improves generation quality at test time
- Mitigates Entropy Collapse: Helps escape local optima during training plateaus
Note: For the best viewing experience, please visit our research website to see all figures in high resolution.
Figure 1: Annealing Schedule
- Shows how different decay rates d affect the temperature schedule
- A larger d slows the cooling, front-loading exploration over more tokens
- View full figure
Figure 2: Performance Results
- Pass@16 and Worst@16 evaluation in RL training
- EAD significantly improves exploration of high-quality samples
- View full figure
Figure 3: Entropy Dynamics
- EAD mitigates entropy collapse by maintaining exploration throughout training
- Helps escape local optima during plateau stages
- View full figure
Figure 4: Algorithm Compatibility
- EAD works with various RL algorithms (GRPO, EntropyMech)
- Consistently outperforms fixed-temperature sampling
- View full figure
Check out the implementation in recipe/ead/ and explore the research website for detailed results and visualizations.
Quick Start with EAD:
# Follow Minimal-RL to prepare the dataset
# Then run EAD with annealed sampling
cd recipe/ead
bash run_annealed_sampling.sh