| CARVIEW |
PostTrainBench
Measuring how well AI agents can post-train language models
Can AI agents improve performance of base LLMs? We give each agent 4 small target LLMs, an H100 GPU, and 10 hours to post-train them.
Leaderboard
1 The average is taken across all post-trained LLMs (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, Gemma 3 4B IT) and benchmarks (AIME 2025, BFCL, GPQA Main, GSM8K, HumanEval). For each run, we ask a CLI agent to maximize the performance of a specific base LLM on a specific benchmark.
2 "Human Post-Trained" is not directly comparable to the rest since it usually exceeds the 10h + 1 GPU constraint.
| Rank | Method | Average Score | AIME 2025 | BFCL | GPQA Main | GSM8K | HumanEval |
|---|
More agents coming soon...
Detailed Breakdown by Benchmark
Average Time Spent
Time taken by each agent to complete post-training (out of 10 hours).
Different agents demonstrate varying levels of persistence - some give up well before the time limit expires.
Pipeline
Evaluation Benchmarks
Post-trained models are evaluated across these benchmarks to measure improvement in reasoning, knowledge, and problem-solving capabilities
About
Post-Train Bench measures AI R&D automation by testing whether AI agents can successfully post-train other language models. Each agent receives 4 base models (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, and Gemma 3 4B), access to an H100 GPU, and a 10-hour time limit to improve model performance through post-training
Experimental Setup
- Models: Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, Gemma 3 4B
- Hardware: Single H100 GPU per agent
- Time Limit: 10 hours per agent
- Evaluation: Average score across 5 benchmarks
- Agent scaffolds: Native CLI scaffolds (Claude Code for Claude models, Codex CLI for OpenAI, Gemini CLI for Gemini)
Observations
Agent Behaviors
Claude Opus 4.5
Most Structured- Uses explicit todo lists to track progress
- Web searches for best practices
- Creates detailed implementation plans before coding
GPT-5.x Variants
Action-Oriented- Immediately starts exploring files and datasets
- "Plan update" checkpoints with bullet points
- Less formal planning, more exploratory
Gemini 3 Pro
Quick to Execute- Less planning overhead
- Jumps directly into implementation
- More failures due to less error anticipation
GPT-5.1 Codex Max
Best Performer- Building proper dataset pipelines (55k+ samples)
- Iterating on training scripts when errors occurred
- Using appropriate hyperparams (gradient checkpointing, bf16)
Time & Trace Patterns
Agents had 3-10 hour limits. Behaviors varied significantly:
- GPT-5.1-codex: Often ran extremely long traces (381k+ lines on BFCL)
- Claude: Regularly checked
timer.shfor remaining time - Gemini: Shorter traces, faster iteration but more failures
Reward Hacking (Near Misses)
Claude found that Qwen/Qwen3-1.7B (the instruct-tuned version) works "perfectly" for function calling. However, it then explicitly acknowledged:
"However, the user specifically said to use Qwen/Qwen3-1.7B-Base. Let me re-read the user's constraint... So I must use the BASE model."
All agents showed awareness of contamination rules:
- Claude: "Cannot use [benchmark] test data for training (data contamination)"
- GPT models: "avoid leaking evaluation data", "avoiding test contamination"
- All agents sourced training data from alternative datasets (MBPP, glaive-function-calling, Hermes, etc.)
Key Takeaways
Dataset quality > training duration: GPT-5.1-codex-max's success came from careful dataset curation, not longer training
Constraint awareness: Almost all agents showed understanding of rules and avoided contamination
Self-correction: Claude showed self-correction that avoids reward hacking by model substitution
Library issues: Many errors came from library version mismatches (trl, transformers)
Format alignment matters: For function calling, matching exact output format was essential for high scores
Longer traces ≠ better results: GPT-5.1-codex had longest traces but inconsistent results; GPT-5.1-codex-max had shorter traces but better outcomes
Team
Max Planck Institute for Intelligent Systems
Tübingen AI Center
Tübingen AI Center
Tübingen AI Center
Max Planck Institute for Intelligent Systems
Tübingen AI Center
Citation
If you found PostTrainBench useful, please cite us as:
@misc{posttrainbench_2025,
title={PostTrainBench: Measuring AI Ability to Perform LLM Post-Training},
author={Rank, Ben and Bhatnagar, Hardik and Bethge, Matthias and Andriushchenko, Maksym},
year={2025}
}