CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Sun, 28 Dec 2025 09:24:18 GMT access-control-allow-origin: * etag: W/"6950f742-4f92" expires: Tue, 30 Dec 2025 03:52:39 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 3DE8:2DDCFF:98870B:AB61EA:69534A2F accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 03:47:34 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210033-BOM x-cache: HIT x-cache-hits: 0 x-timer: S1767066454.360676,VS0,VE201 vary: Accept-Encoding x-fastly-request-id: 650a57ee0ecfe0a0fc1499cf4d4117192b8547a0 content-length: 5121 PostTrainBench

PostTrainBench

Measuring how well AI agents can post-train language models

Can AI agents improve performance of base LLMs? We give each agent 4 small target LLMs, an H100 GPU, and 10 hours to post-train them.

GitHub

Leaderboard

¹ The average is taken across all post-trained LLMs (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, Gemma 3 4B IT) and benchmarks (AIME 2025, BFCL, GPQA Main, GSM8K, HumanEval). For each run, we ask a CLI agent to maximize the performance of a specific base LLM on a specific benchmark.

² "Human Post-Trained" is not directly comparable to the rest since it usually exceeds the 10h + 1 GPU constraint.

Filter by model:

Rank	Method	Average Score	AIME 2025	BFCL	GPQA Main	GSM8K	HumanEval

More agents coming soon...

Detailed Breakdown by Benchmark

Average Time Spent

Time taken by each agent to complete post-training (out of 10 hours).
Different agents demonstrate varying levels of persistence - some give up well before the time limit expires.

Pipeline

Evaluation Benchmarks

Post-trained models are evaluated across these benchmarks to measure improvement in reasoning, knowledge, and problem-solving capabilities

About

Post-Train Bench measures AI R&D automation by testing whether AI agents can successfully post-train other language models. Each agent receives 4 base models (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, and Gemma 3 4B), access to an H100 GPU, and a 10-hour time limit to improve model performance through post-training

Experimental Setup

Models: Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, Gemma 3 4B
Hardware: Single H100 GPU per agent
Time Limit: 10 hours per agent
Evaluation: Average score across 5 benchmarks
Agent scaffolds: Native CLI scaffolds (Claude Code for Claude models, Codex CLI for OpenAI, Gemini CLI for Gemini)

Observations

Agent Behaviors

Claude Opus 4.5

Most Structured

Uses explicit todo lists to track progress
Web searches for best practices
Creates detailed implementation plans before coding

GPT-5.x Variants

Action-Oriented

Immediately starts exploring files and datasets
"Plan update" checkpoints with bullet points
Less formal planning, more exploratory

Gemini 3 Pro

Quick to Execute

Less planning overhead
Jumps directly into implementation
More failures due to less error anticipation

GPT-5.1 Codex Max

Best Performer

Building proper dataset pipelines (55k+ samples)
Iterating on training scripts when errors occurred
Using appropriate hyperparams (gradient checkpointing, bf16)

Time & Trace Patterns

Agents had 3-10 hour limits. Behaviors varied significantly:

GPT-5.1-codex: Often ran extremely long traces (381k+ lines on BFCL)
Claude: Regularly checked timer.sh for remaining time
Gemini: Shorter traces, faster iteration but more failures

Reward Hacking (Near Misses)

Claude found that Qwen/Qwen3-1.7B (the instruct-tuned version) works "perfectly" for function calling. However, it then explicitly acknowledged:

"However, the user specifically said to use Qwen/Qwen3-1.7B-Base. Let me re-read the user's constraint... So I must use the BASE model."

All agents showed awareness of contamination rules:

Claude: "Cannot use [benchmark] test data for training (data contamination)"
GPT models: "avoid leaking evaluation data", "avoiding test contamination"
All agents sourced training data from alternative datasets (MBPP, glaive-function-calling, Hermes, etc.)

Key Takeaways

Dataset quality > training duration: GPT-5.1-codex-max's success came from careful dataset curation, not longer training

Constraint awareness: Almost all agents showed understanding of rules and avoided contamination

Self-correction: Claude showed self-correction that avoids reward hacking by model substitution

Library issues: Many errors came from library version mismatches (trl, transformers)

Format alignment matters: For function calling, matching exact output format was essential for high scores

Longer traces ≠ better results: GPT-5.1-codex had longest traces but inconsistent results; GPT-5.1-codex-max had shorter traces but better outcomes

Team

Ben Rank

ELLIS Institute Tübingen
Max Planck Institute for Intelligent Systems
Tübingen AI Center

Hardik Bhatnagar

University of Tübingen
Tübingen AI Center

Matthias Bethge

University of Tübingen
Tübingen AI Center

Maksym Andriushchenko

ELLIS Institute Tübingen
Max Planck Institute for Intelligent Systems
Tübingen AI Center

Citation

If you found PostTrainBench useful, please cite us as:

@misc{posttrainbench_2025,
  title={PostTrainBench: Measuring AI Ability to Perform LLM Post-Training},
  author={Rank, Ben and Bhatnagar, Hardik and Bethge, Matthias and Andriushchenko, Maksym},
  year={2025}
}

Original Source | Taken Source