HOME
ABOUT
- RESULTS
- differences
- BENEFITS
- HISTORY
- TEAM
- LOCATION
- FACILITIES
- BANKING
- MEMBERSHIPS
- APPROVALS
- LICENCES
- SUPPLIERS
- SPONSORSHIPS
- MEDIA
- PRIVACY
AUCTIONS
SHIPPING
FEES
- TS REWARDS
TOOLS
guides
FAQ
CONTACT
- CONNECT

VEHICLES
BRAND
- JAPANESE CARS
  - DAIHATSU
  - EUNOS
  - FORD
  - HONDA
  - ISUZU
  - LEXUS
  - MAZDA
  - MITSUBISHI
  - MITSUOKA
  - NISSAN
  - SUBARU
  - SUZUKI
  - TOYOTA
- GERMAN CARS
- AMERICAN CARS
- BRITISH CARS
- ITALIAN CARS
- FRENCH CARS
- SWEDISH CARS
- KOREAN CARS
TYPE
- mobility
- VENDING
- instruction
- TAXIS
- AMBULANCES
- FIRE ENGINES
- HEARSES
- LIMOUSINES
- COMMERCIAL
CLASS
FUEL
TRUCKS
minitrucks
- DAIHATSU
- HONDA
- MAZDA
- MITSUBISHI
- NISSAN
- SUBARU
- SUZUKI
- DUMP
- CRANE
- CAMPER
- REFRIGERATED
- 4WD
- NEW
BUSES
MOTORHOMES
- YAHOO!
- RAKUTEN
- DEALER

PARTS
- FREE REPORT
- PARTS CONTAINERS
- PARTS SYSTEMS
- PARTS PROTECTION
- BODY SHELLS
- DISMANTLING
- ONLINE PARTS
- NEW PARTS
- INTERIOR PARTS
- EXTERIOR PARTS
  - BONNETS
  - BUMPERS
  - GRILLES
  - FENDERS
  - DOORS
  - TRUNKS
  - SPOILERS
  - LIGHTS
  - EMBLEMS
  - CAMERAS
- ENGINES
- TRANSMISSIONS
- WHEELS & TYRES
  - WHEELS
  - TYRES
CUTS
PERFORMANCE PARTS
TRUCK PARTS
MOTORBIKE PARTS
- MOTORBIKE ENGINES
- MOTORBIKE ACCESSORIES

MOTORBIKES
MARINE
FORKLIFTS
MACHINERY
AGRICULTURAL
OTHER
COUNTRY
- AUSTRALIA
- CANADA
- KENYA
- MYANMAR
- NEW ZEALAND
- PAKISTAN
- TANZANIA
- UNITED STATES

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Sat, 20 Dec 2025 04:58:07 GMT access-control-allow-origin: * etag: W/"69462cdf-ea9e" expires: Tue, 30 Dec 2025 11:16:36 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: D352:234FE9:A0FF84:B4A5FE:6953B23A accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 11:27:21 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210048-BOM x-cache: HIT x-cache-hits: 0 x-timer: S1767094042.680907,VS0,VE207 vary: Accept-Encoding x-fastly-request-id: c2434a2947efbdf58a6ecd8607a536128d0aff9e content-length: 14170 AgentFlow: In-the-Flow Agentic System Optimization

More Research

Chameleon

🔥

TextGrad

🔥

OctoTools

🔥

VerlTool

🔥

ChemAgent Weak-for-Strong AgentDebug TaTToo

Introduction
YouTube Video
AgentFlow System
Flow-GRPO Training
Featured Tools
Case Studies
Results
Share
BibTeX

AgentFlow

In-the-Flow Agentic System Optimization

Zhuofeng Li*^1,2, Haoxiang Zhang*^1,3, Seungju Han¹, Sheng Liu¹, Jianwen Xie⁴,
Yu Zhang², Yejin Choi¹, James Zou†¹, Pan Lu†¹

¹ Stanford University, ² Texas A&M University, ³ UC San Diego, ⁴ Lambda
* Equal Contribution † Co-senior authors

arXiv Code

Paper

Model

Demo

Video

Wiki

💬

Slack

Performance comparison across 10 diverse benchmarks. AgentFlow with a 7B-scale backbone achieves substantial improvements over top-performing baselines across search, agentic, mathematical, and scientific reasoning tasks.

YouTube Video

Thanks to Discover AI for featuring AgentFlow!

Introduction

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction.

We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages.

Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.

One case study example. Initially failed with repetitive errors (left), AgentFlow, trained with Flow-GRPO, explores a new solution pathway at turn 4 after two failed attempts (right).

AgentFlow: An In-the-Flow Agentic System

(a) Overview of AgentFlow, a trainable agentic system for in-the-flow planning and tool use. Four modules—planner, executor, verifier, and generator—interact via evolving memory $M$ and toolset $K$, given query $q$. The planner policy is optimized on-policy inside the system's multi-turn loop for adaptive reasoning. (b) A single state transition: $a^t$, $e^t$, and $v^t$ update memory from $M^t$ to $M^{t+1}$.

AgentFlow is a general-purpose tool-integrated agentic framework for solving complex reasoning tasks through fine-grained planning and effective tool use. It comprises four specialized modules—Planner $\mathcal{P}$, Executor $\mathcal{E}$, Verifier $\mathcal{V}$, and Generator $\mathcal{G}$—coordinated by shared memory $M$ and a toolset $K$. We formalize AgentFlow's problem-solving process as a multi-turn Markov Decision Process (MDP): given query $q$ and toolset $K$, the planner $\mathcal{P}$ (a trainable policy $\pi_\theta$) produces an action $a^t \sim \pi_\theta(a^t \mid q, K, M^t)$ that formulates a sub-goal, selects a tool $k \in K$, and retrieves relevant context from memory $M^t$. The executor $\mathcal{E}$ invokes tools according to $a^t$, yielding execution results $e^t \sim \mathcal{E}(e^t \mid a^t, K)$. The verifier $\mathcal{V}$ evaluates $e^t$, producing a binary verification signal $v^t \sim \mathcal{V}(v^t \mid q, e^t, M^t)$. If $v^t = 0$, the memory is updated deterministically: $M^{t+1} = f_{\text{mem}}(M^t, a^t, e^t, v^t)$. This process repeats until $v^t = 1$ (termination) or a maximum turn budget is reached. Upon termination at turn $T$, the generator $\mathcal{G}$ produces the final solution $o \sim \mathcal{G}(o \mid q, M^T)$. After $T$ turns, the trajectory $\tau = \{(a^t, e^t, v^t)\}_{t=1}^T$ records planning, execution, and verification steps. The joint generative process is:

$$p_\theta(\{a^t,e^t,v^t\}_{1:T}, o \mid q) = \Big[\prod_{t=1}^T \pi_\theta(a^t \mid q,K,M^t)\; \mathcal{E}(e^t \mid a^t,K)\; \mathcal{V}(v^t \mid q,e^t,M^t)\Big]\; \mathcal{G}(o \mid q,M^T).$$

Flow-based Group Refined Policy Optimization

Optimization of AgentFlow. Given a query $q$, memory $M$, and toolset $K$, the policy generates actions for sub-goals and tool selection. It is trained via Flow-GRPO — a reinforcement learning method enabling multi-turn, stable optimization under collaborative dynamics.

Training Objective

We optimize the planner policy $\pi_\theta$ online within the AgentFlow system. For each query $(q,y^*)$, we sample $G$ on-policy trajectories $\{\tau_i\}_{i=1}^G$ where $\tau_i = \{a_i^1, \ldots, a_i^{T_i}, o_i\}$. The planner maximizes: $$\mathcal{J}(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)], \quad \theta^\star=\arg\max_\theta \mathcal{J}(\theta).$$

We use a final-outcome reward: every action receives the same trajectory-level signal based on solution correctness: $$r = R(a^t) = \bar{R}(o, q, y^*), \quad \forall t = 1,\dots,T,$$ where $\bar{R}(o, q, y^*) \in \{0, 1\}$ is determined by an LLM-as-judge. This broadcasts the global success signal to all intermediate decisions.

Flow-GRPO Formulation

Let $s_i^t=(q, K, M_i^t)$ be the state at turn $t$ of rollout $i$, and $a_i^t$ the planner's action (token sequence of length $|a_i^t|$). The objective is:

$$\begin{aligned} \mathcal{J}_{\text{Flow-GRPO}}(\theta) &= \mathbb{E}_{(q, y^*) \sim \mathcal{D}, \; \{\tau_i\}_{i=1}^{G} \sim \pi_{\theta_\text{old}}} \\ & \Bigg[ \frac{1}{G}\sum_{i=1}^{G}\frac{1}{T_i}\sum_{t=1}^{T_i}\frac{1}{|a_i^t|}\sum_{j=1}^{|a_i^t|} \min\!\Big\{ \rho_{i,j}^t A_i^t,\, \mathrm{clip}(\rho_{i,j}^t,\,1-\epsilon,\,1+\epsilon)\,A_i^t \Big\} \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big(\pi_\theta \,\|\, \pi_{\text{ref}}\big) \Bigg], \end{aligned}$$

where $\rho_{i,j}^t = \frac{\pi_\theta(a_{i,j}^t \mid s_i^t, a_{i,1:j-1}^t)}{\pi_{\theta_\text{old}}(a_{i,j}^t \mid s_i^t, a_{i,1:j-1}^t)}$ is the token-level importance ratio, $\epsilon>0$ is the PPO clipping parameter, and $\beta>0$ controls KL penalty to reference policy $\pi_{\text{ref}}$.

The advantage is group-normalized to reduce variance: $$A_i^t = \frac{\bar{R}(o_i, q, y^*) - \mathrm{mean}\left( \{ \bar{R}(o_k, q, y^*) \}_{k=1}^{G} \right)}{\mathrm{std}\left( \{ \bar{R}(o_k, q, y^*) \}_{k=1}^{G} \right)}.$$ By broadcasting a single trajectory-level reward to all turns, we decompose multi-turn RL into tractable single-turn policy updates.

Featured Tools

AgentFlow leverages a diverse set of specialized tools to accomplish complex reasoning tasks

Case Study Visualization

Experimental Results

Main Results

To comprehensively evaluate tool-use capabilities of AgentFlow, we conduct experiments on four types of reasoning tasks: (1) Knowledge-intensive search including Bamboogle, 2Wiki, HotpotQA, and Musique; (2) Agentic reasoning such as GAIA (where we adopt the textual split); (3) Logic-dense mathematical reasoning including AIME 2024, AMC 23, and Game Of 24; and (4) Scientific reasoning including GPQA and MedQA.

Accuracy comparison on search-intensive and agentic tasks. 7B-Base refers to Qwen-2.5-7B-Base and 7B-Inst refers to Qwen-2.5-7B-Instruct. AutoGen and our AgentFlow method are agentic systems, which use Qwen-2.5-7B-Instruct for the LLM-powered agents and tools for fair comparison. We visualize the gains of AgentFlow to each baseline in the Δ columns.

Baselines: We compare against four categories of baselines: (1) Open-source LLMs: Qwen-2.5 (7B, 14B, 32B) and Llama-3.3-70B; (2) Proprietary LLMs: GPT-4o-mini and GPT-4o; (3) Tool-integrated reasoning LLMs: Supervised Fine-Tuning (SFT), Iter-RetGen, Search-R1, ZeroSearch, ReSearch, StepSearch, and VerlTool; (4) Training-free agentic system: AutoGen.

Accuracy comparison of mathematical and scientific reasoning tasks. We visualize the gains of AgentFlow to each baseline in the Δ columns.

Baselines: We compare against four categories of baselines: (1) Open-source LLMs: Qwen-2.5 (7B, 14B) and Llama-3.3-70B, Llama-3.1-405B; (2) Proprietary LLMs: GPT-4o-mini and GPT-4o; (3) Reasoning LLMs: Supervised Fine-Tuning (SFT), SimpleRL-reason, Open-Reasoner-Zero, General-Reasoner, and Luffy; (4) Tool-integrated reasoning LLMs: TIR and ToRL; (5) Training-free agentic system: AutoGen.

In-Depth Analysis

We conduct comprehensive analyses to understand the effectiveness of Flow-GRPO and the behavior of AgentFlow across various dimensions.

Impact of Planner Training Strategies. Experiments demonstrate that training the planner with the online reinforcement learning method, Flow-GRPO, yields a significant 17.2% performance improvement, whereas traditional offline Supervised Fine-Tuning (SFT) results in a catastrophic 19.0% performance collapse.

Optimized and Adaptive Tool Selection. After optimization with Flow-GRPO, the planner learns to select the most appropriate tools for different tasks, such as increasing the use of Google Search for the broad-knowledge 2Wiki task while shifting to the more specialized Wikipedia and Web Search for the domain-specific MedQA task.

Enhanced Tool-Calling Reliability. The Flow-GRPO training process enhances tool-calling reliability, as evidenced by a consistent decrease in the tool-calling error rate across all tasks, with a reduction of up to 28.4% on the GAIA task.

Superior Training Efficiency and Stability. Analysis of training dynamics reveals that Flow-GRPO not only continuously increases rewards (accuracy) while shortening response length but also achieves more stable and sustained performance growth compared to traditional monolithic methods like ToRL.

Consistent Gains Across Model Scales. Flow-GRPO's online fine-tuning method delivers consistent and effective performance gains on AgentFlow, regardless of whether the backbone model scales from 3B to 7B parameters.

Performance Scaling with Inference Turns. During the inference phase, increasing the maximum allowed interaction turns from 3 to 10 enables AgentFlow to conduct deeper reasoning, leading to continuous improvements in final performance across all tasks.

Adaptability to Upgraded Tool Engines. The trained AgentFlow system demonstrates strong adaptability, as its overall performance significantly improves when its internal tool engines are upgraded from Qwen-2.5-7B-Instruct to the more powerful GPT-4o.

Share AgentFlow

Share on X (Twitter) Share on LinkedIn

BibTeX

@article{li2025flow,
    title={In-the-Flow Agentic System Optimization for Effective Planning and Tool Use},
    author={Li, Zhuofeng and Zhang, Haoxiang and Han, Seungju and Liu, Sheng and Xie, Jianwen and Zhang, Yu and Choi, Yejin and Zou, James and Lu, Pan},
    journal={arXiv preprint arXiv:2510.05592},
    year={2025}
}