| CARVIEW |
Playground for Compositional Machine Design
Agentic Design Of Compositional Machines
1The Chinese University of Hong Kong (Shenzhen)
2The Chinese University of Hong Kong
* Equal advising. † Corresponding Author.
The task of compositional machine design is illustrated in our BesiegeField environment. The figure shows a high-level sketch of the agentic workflow (w/ Gemini Pro 2.5), along with the resulting machines and their simulated performance. The design objective is to create a machine that throws boulders long distances.
The Challenging Task of Compositional Machine Design
We demonstrate the motion mechanism of a human-designed trebuchet — a powerful medieval type of catapult. Each component works in close coordination, enabling the machine to launch the projectile much farther than those built by LLMs. We also show how omitting even a single part can cause the entire mechanism to fail, highlighting the inherent difficulty of compositional machine design.
Agentic Machine Construction
Agentic Machine Construction - Reasoning
Example CoT of inspector agents (w/ Gemini 2.5 Pro). Blue text highlights the moderate capability of LLMs in spatial reasoning and imagined physical simulation.
Gallery of Tasks
Moving on rough terrains
Throwing stones far away
Throwing stones through a ring
Delivering boulders
Moving on a curved track
Picking objects from the bottom of a well
Performance Leaderboard
| Models | Single-agent | Iterative Editing | Hierarchical Design | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Mean | Max | Std | Mean | Max | Std | Mean | Max | Std | |
| Gemini 2.5 Pro | 2.30 | 9.00 | 3.86 | 4.67 | 21.95 | 8.68 | 9.83 | 18.19 | 8.35 |
| OpenAI o3 | 2.87 | 5.22 | 1.96 | 9.14 | 14.01 | 3.71 | 2.00 | 11.11 | 3.98 |
| Qwen3-Coder-480B-A35B | 1.75 | 9.24 | 3.17 | 5.10 | 12.02 | 5.54 | 3.90 | 6.52 | 2.54 |
| Doubao Seed 1.6-250615 | 3.18 | 8.20 | 2.99 | 4.82 | 9.10 | 3.41 | 1.73 | 4.76 | 2.39 |
| Claude Opus 4-20250514 | 1.19 | 4.82 | 2.21 | 1.18 | 4.91 | 2.18 | 2.27 | 9.32 | 4.22 |
| DeepSeek-V3 | 3.50 | 4.86 | 2.17 | 3.07 | 5.24 | 2.55 | 2.41 | 4.93 | 2.58 |
| Kimi K2-0711-preview | 2.57 | 9.05 | 3.72 | 2.82 | 11.39 | 5.23 | 5.39 | 12.02 | 5.16 |
| Llama 4 Scout 17B 16E | 3.18 | 5.64 | 1.95 | 1.28 | 5.94 | 2.41 | 3.59 | 11.83 | 4.15 |
| Models | Single-agent | Iterative Editing | Hierarchical Design | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Mean | Max | Std | Mean | Max | Std | Mean | Max | Std | |
| Gemini 2.5 Pro | 33.96 | 40.85 | 6.73 | 34.34 | 41.66 | 13.96 | 29.96 | 41.52 | 7.78 |
| OpenAI o3 | 15.28 | 32.08 | 8.97 | 14.34 | 35.08 | 11.79 | 28.39 | 36.18 | 11.01 |
| Qwen3-Coder-480B-A35B | 8.87 | 11.50 | 4.46 | 15.24 | 28.95 | 13.12 | 12.59 | 34.05 | 10.78 |
| Doubao Seed 1.6-250615 | 3.51 | 9.40 | 4.85 | 8.11 | 10.04 | 3.58 | 18.75 | 26.02 | 4.38 |
| Claude Opus 4-20250514 | 9.83 | 12.98 | 1.28 | 8.07 | 28.04 | 12.48 | 14.56 | 38.67 | 20.69 |
| DeepSeek-V3 | 9.06 | 10.53 | 3.68 | 8.23 | 18.84 | 7.12 | 17.92 | 31.94 | 12.85 |
| Kimi K2-0711-preview | 1.75 | 8.09 | 2.80 | 14.36 | 28.34 | 9.47 | 1.94 | 14.99 | 5.48 |
| Llama 4 Scout 17B 16E | 0.02 | 0.03 | 0.01 | 3.04 | 12.76 | 5.23 | 1.55 | 2.00 | 0.32 |
LLM-Agent-Generated Machines
Generated by agentic systems with different LLMs.
Results from RL-finetuned LLMs
Reinforcement learning with verifiable rewards (RLVR) improves Qwen-2.5-14B model in compositional machine design.
| Models | Catapult | Car | ||||
|---|---|---|---|---|---|---|
| Validity Ratio | Mean Score | Max Score | Validity Ratio | Mean Score | Max Score | |
| Qwen2.5-14B-Instruct | 11/50 | 0.06 | 2.41 | 46/50 | 4.97 | 19.10 |
| Qwen2.5-14B-Instruct + Cold-Start | 9/50 | 0.11 | 5.54 | 40/50 | 4.67 | 20.23 |
| Qwen2.5-14B-Instruct + RL | 12/50 | 0.13 | 5.92 | 41/50 | 3.72 | 24.08 |
| Qwen2.5-14B-Instruct + Cold-Start + RL | 11/50 | 0.14 | 7.14 | 42/50 | 5.05 | 45.72 |
BibTeX
@article{zhang2025besiegefield,
title={Agentic Design of Compositional Machines},
author={Zhang, Wenqian and Liu, Weiyang and Liu, Zhen},
journal={arXiv preprint arXiv:2510.14980},
year={2025},
}