| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Mon, 22 Dec 2025 09:30:47 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"69490fc7-238a"
expires: Sun, 28 Dec 2025 08:27:40 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 7C9C:2F7ECD:765A6F:849F75:6950E7A3
accept-ranges: bytes
age: 0
date: Sun, 28 Dec 2025 08:17:40 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210049-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766909860.909005,VS0,VE205
vary: Accept-Encoding
x-fastly-request-id: 2be804a6587a98ab20cf50acc9f95ba88474d4eb
content-length: 2034
SIMPACT
SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
Haowen Liu*1
Shaoxiong Yao*2
Haonan Chen3
Jiawei Gao3
Jiayuan Mao4,5 Jia-Bin Huang1 Yilun Du3
Jiayuan Mao4,5 Jia-Bin Huang1 Yilun Du3
1UMD 2UIUC 3Harvard 4Amazon FAR 5UPenn
(* indicates equal contribution)
(* indicates equal contribution)
TL;DR: We present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips Vision-Language Models (VLMs) with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training.
From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning in a physically grounded way.
Method Overview
Pipeline
Method pipeline. Our method first begins by instantiating a physics simulator given the real-world scene. Next, a VLM-based action sampler and optimizer iteratively refine the action sequence towards task success using simulated rollouts as context. The final optimized actions are then executed in the real world.
Simulation construction from single RGBD image. Simulation construction from single RGBD image. Given an RGB-D image and a language task description, our pipeline automatically generates either a mesh-based simulation (top) for rigid objects or a particle-based simulation (bottom) for deformables. In both cases, we prompt the VLM to infer the relevant physical parameters required for simulation.
Results
Non-toppling Push
Bowl Stacking
Pivoting
Rope Manipulation
Play-Doh Manipulation
Baseline Comparison
Non-toppling Push
Bowl Stacking
Pivoting
Rope Manipulation
Play-Doh Manipulation
Robustness Against Variations
Variation: Different reference objects
Variation: Additional distractors
Variation: Different rope materials & thickness
Variation: Different Play-Doh shape & color
Citation
@article{simpact2025,
title={SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models},
author={Liu, Haowen and Yao, Shaoxiong and Chen, Haonan and Gao, Jiawei and Mao, Jiayuan and Huang, Jia-Bin and Du, Yilun},
journal={arXiv preprint arXiv:2512.05955},
year={2025}
}