CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Mon, 22 Dec 2025 09:30:47 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"69490fc7-238a" expires: Sun, 28 Dec 2025 08:27:40 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 7C9C:2F7ECD:765A6F:849F75:6950E7A3 accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 08:17:40 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210049-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766909860.909005,VS0,VE205 vary: Accept-Encoding x-fastly-request-id: 2be804a6587a98ab20cf50acc9f95ba88474d4eb content-length: 2034 SIMPACT

SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

Haowen Liu*¹ Shaoxiong Yao*² Haonan Chen³ Jiawei Gao³
Jiayuan Mao^4,5 Jia-Bin Huang¹ Yilun Du³

¹UMD ²UIUC ³Harvard ⁴Amazon FAR ⁵UPenn
(* indicates equal contribution)

Paper Code (coming soon) BibTeX

TL;DR: We present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips Vision-Language Models (VLMs) with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning in a physically grounded way.

Method Overview

Pipeline

Method pipeline. Our method first begins by instantiating a physics simulator given the real-world scene. Next, a VLM-based action sampler and optimizer iteratively refine the action sequence towards task success using simulated rollouts as context. The final optimized actions are then executed in the real world.

Simulation construction from single RGBD image. Simulation construction from single RGBD image. Given an RGB-D image and a language task description, our pipeline automatically generates either a mesh-based simulation (top) for rigid objects or a particle-based simulation (bottom) for deformables. In both cases, we prompt the VLM to infer the relevant physical parameters required for simulation.

Results

Non-toppling Push

Bowl Stacking

Pivoting

Rope Manipulation

Play-Doh Manipulation

Baseline Comparison

Non-toppling Push

Bowl Stacking

Pivoting

Rope Manipulation

Play-Doh Manipulation

Robustness Against Variations

Variation: Different reference objects

Variation: Additional distractors

Variation: Different rope materials & thickness

Variation: Different Play-Doh shape & color

Citation

@article{simpact2025,
  title={SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models},
  author={Liu, Haowen and Yao, Shaoxiong and Chen, Haonan and Gao, Jiawei and Mao, Jiayuan and Huang, Jia-Bin and Du, Yilun},
  journal={arXiv preprint arXiv:2512.05955},
  year={2025}
}

Original Source | Taken Source