CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Mon, 03 Nov 2025 03:47:07 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"690825bb-a3e6" expires: Sun, 28 Dec 2025 15:01:58 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 5C8F:234FE9:7CBBDE:8BB3F0:6951440E accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 14:51:58 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210036-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766933518.355395,VS0,VE200 vary: Accept-Encoding x-fastly-request-id: 57fac8ad8c03aba5b222843a0b6158541da2a6b4 content-length: 10306 RAGEN - RL Agent

RAGEN

RAGEN

Training Agents by Reinforcing Reasoning

LLM Agents + Multi-turn Reinforcement Learning
to train LLM reasoning agents in interactive, stochastic environments.
Announcing VAGEN for VLM Agents >

Get Started

1.4k

105

Paper Documentation tl;dr Logs Collaborate with us →

Comparison between RAGEN and existing LLM training methods.

Zihan Wang^*1, Kangrui Wang^*1, Qineng Wang^*1, Pingyue Zhang^*1, Linjie Li^*2

Zhengyuan Yang⁴, Xing Jin⁶, Kefan Yu¹, Minh Nhat Nguyen⁷, Licheng Liu¹, Eli Gottlieb¹,

Yiping Lu¹, Kyunghyun Cho⁵, Jiajun Wu³, Li Fei-Fei³, Lijuan Wang⁴, Yejin Choi³, Manling Li¹

^* Equal Contribution

¹ Northwestern University ² University of Washington ³ Stanford University ⁴ Microsoft
⁵ New York University ⁶ University of British Columbia ⁷ Singapore Management University

StarPO (State-Thinking-Action-Reward Policy Optimization)

📋

Initial State

🧠

Reasoning

⚡

Action

🏆

Reward

The StarPO (State-Thinking-Action-Reward Policy Optimization) framework with two interleaved stages: rollout stage and update stage.

MDP Formulation

We formulate agent-environment interactions as Markov Decision Processes (MDPs) where states and actions are token sequences, allowing LLMs to reason over environment dynamics.

State $s_t$

token sequence

Action $a_t$

State $s_{t+1}$

new token sequence

At time $t$, state $s_t$ transitions to the next state through action $a_t$ following a transition function $P(s_{t+1} | s_t, a_t)$. The policy $\pi(a_t | s_t)$ generates actions given the trajectory history. The objective is to maximize expected cumulative rewards $\mathbb{E}_\pi[\sum_t \gamma^t r_t]$ across multiple interaction turns.

StarPO: Reinforcing Reasoning via Trajectory-Level Optimization

StarPO is a general RL framework for optimizing entire multi-turn interaction trajectories for LLM agents. The algorithm alternates between two phases:

Rollout Stage: Reasoning-Interaction Trajectories

Given an initial Sokoban puzzle state, the LLM generates multiple solving trajectories. At each step, the model receives the puzzle state and generates a reasoning-guided action to push boxes to goal positions:

 <think>I need to push the box ($) to the goal (.) which is directly to the right.</think><ans> Right </ans> 

The environment receives the action and returns the next state with the box pushed to the goal.

🤖

I need to push the box ($) to the goal (.) which is directly to the right. To do this, I need to move right and push the box.

Right

🌐

#####
#_@*#
#####

Box pushed to goal position. State updated.

Update Stage: Multi-turn Trajectory Optimization

After generating trajectories, we train LLMs to optimize expected rewards. Instead of step-by-step optimization, StarPO optimizes entire trajectories using importance sampling. This approach enables long-horizon reasoning while maintaining computational efficiency.

StarPO supports multiple optimization strategies:

PPO

GRPO

PPO (Proximal Policy Optimization):

\[J_{\text{PPO}}(\theta) = \frac{1}{G} \sum_{i=1}^G \frac{1}{|\tau_i|} \sum_{t=1}^{|\tau_i|} \min \left( \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)} A_{i,t}, \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right) A_{i,t} \right)\]

GRPO (Group Relative Policy Optimization):

\[\hat{A}_{i,t} = \frac{R(\tau_i) - \text{mean}(\{R(\tau_1), \ldots, R(\tau_G)\})}{\text{std}(\{R(\tau_1), \ldots, R(\tau_G)\})}\] \[J_{\text{GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^G \frac{1}{|\tau_i|} \sum_{t=1}^{|\tau_i|} \min \left( \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)} \hat{A}_{i,t}, \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right) \hat{A}_{i,t} \right)\]

Rollout and update stages interleave in StarPO, enabling both online and offline learning.

Findings

Finding 1: Single-turn RL may not be directly adapted to Multi-turn agent RL

Vanilla adaptations from single-turn methods like PPO and GRPO achieve early gains in agent settings but often collapse. A critic in PPO may delay instability, but would not prevent reasoning degradation, highlighting the need for specialized stabilization in agent settings.

Finding 2: Model collapse in agent RL is reflected as "Echo Trap" over training

We find that early-stage agent respond with diverse symbolic reasoning, but collapse into deterministic, repetitive templates after training. Models converge to fixed phrasing, indicating that RL may reinforce superficial patterns instead of general reasoning and forms an "Echo Trap" that hinders long-term generalization.

Finding 3: Collapse follows similar dynamics and can be anticipated by indicators

Reward standard deviation and entropy often fluctuate before performance degrades, while gradient norm spikes typically mark the point of irreversible collapse. These metrics provide early indicators and motivate the need for stabilization strategies.

Finding 4: Filtering low-variance trajectories improves stability and efficiency

Training on high-variance prompts delays or eliminates collapse in multi-turn RL. StarPO-S improves performance and reduces update steps by discarding low-information rollouts, especially under PPO. This aligns with active learning principles, where uncertain examples offer the most informative learning signals.

Finding 5: Task diversity, action budget, and rollout frequency affect data quality

Diverse task instances enable better policy contrast and generalization across environments. Moderate action budgets provide enough planning space and avoid the noise introduced by overly long sequences. Up-to-date rollouts ensure optimization targets remain aligned with current policy behavior.

Finding 6: Reasoning fails to emerge without meticulous reward design

While symbolic reasoning can emerge in simple, single-turn tasks under weak supervision, it fails to persist in multi-turn environments without the reward design explicitly encouraging interpretable intermediate reasoning steps. We observe that even with structured prompts, reasoning gradually decays during training if the reward signal focuses only on final outcomes. This suggests that without meticulous reward shaping, agents may tend to collapse into shortcut behaviors that bypass reasoning altogether.

RAGEN Trajectory Examples

Explore agent trajectories across different tasks. View state transitions, LLM-generated actions, and the decision-making process.

Model: Task: Trajectory:

Loading trajectory data...

Step 1 of 5

State

State description will appear here. This represents the environment's current state at the selected step.

LLM Response

Reasoning:

Let me think about the current state...

Action:

move_pawn('e2', 'e4')

Citation

If you find RAGEN useful in your research, we would appreciate it if you consider citing our work:

@misc{ragen,
      title={RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning}, 
      author={Zihan Wang and Kangrui Wang and Qineng Wang and Pingyue Zhang and Linjie Li and Zhengyuan Yang and Xing Jin and Kefan Yu and Minh Nhat Nguyen and Licheng Liu and Eli Gottlieb and Yiping Lu and Kyunghyun Cho and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li},
      year={2025},
      eprint={2504.20073},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.20073}, 
}

Original Source | Taken Source