You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This survey reviews state-of-the-art 3D and 4D world models - systems that learn, predict, and simulate the geometry and dynamics of real environments from multi-modal signals.
We unify terminology, scope, and evaluations, and organize the space into three complementary paradigms by representation:
Learn generative or predictive models from sequential video streams with geometric and temporal constraints. VideoGen focuses on long-horizon consistency, controllability, and scene-level generation, enabling agents to imagine or forecast plausible video rollouts.
Model 3D/4D occupancy grids that encode geometry and semantics in voxel space. OccGen provides a physics-consistent scaffold for robust perception, forecasting, and simulation, bridging low-level sensor data and high-level reasoning.
Leverage point cloud sequences from LiDAR sensors to generate or predict geometry-grounded scenes. LiDARGen emphasizes high-fidelity 3D structure, robustness to environment changes, and applications in safety-critical domains such as autonomous driving.
If you find this work helpful for your research, please kindly consider citing our papers:
@article{survey_3d_4d_world_models,
title = {{3D} and {4D} World Modeling: A Survey},
author = {Lingdong Kong and Wesley Yang and Jianbiao Mei and Youquan Liu and Ao Liang and Dekai Zhu and Dongyue Lu and Wei Yin and Xiaotao Hu and Mingkai Jia and Junyuan Deng and Kaiwen Zhang and Yang Wu and Tianyi Yan and Shenyuan Gao and Song Wang and Linfeng Li and Liang Pan and Yong Liu and Jianke Zhu and Wei Tsang Ooi and Steven C. H. Hoi and Ziwei Liu},
journal = {arXiv preprint arXiv:2509.07996},
year = {2025}
}
@article{worldlens,
title = {{WorldLens}: Full-Spectrum Evaluations of Driving World Models in Real World},
author = {Ao Liang and Lingdong Kong and Tianyi Yan and Hongsi Liu and Wesley Yang and Ziqi Huang and Wei Yin and Jialong Zuo and Yixuan Hu and Dekai Zhu and Dongyue Lu and Youquan Liu and Guangfeng Jiang and Linfeng Li and Xiangtai Li and Long Zhuo and Lai Xing Ng and Benoit R. Cottereau and Changxin Gao and Liang Pan and Wei Tsang Ooi and Ziwei Liu},
journal = {arXiv preprint arXiv:2512.10958},
year = {2025}
}
World modeling has become a cornerstone of modern AI, enabling agents to understand, represent, and predict dynamic environments. While prior research has focused primarily on 2D images and videos, the rapid emergence of native 3D and 4D representations (e.g., RGB-D, occupancy grids, LiDAR point clouds) calls for a dedicated study.
What Are Native 3D Representations?
Unlike 2D projections, native 3D/4D signals directly encode metric geometry, visibility, and motion in the physical coordinates where agents act. Examples include:
RGB-D imagery (2D images with depth channels)
Occupancy grids (voxelized maps of free vs. occupied space)
LiDAR point clouds (3D coordinates from active sensing)
Neural fields (e.g., NeRF, Gaussian Splatting)
What Are World Models in 3D and 4D?
A 3D/4D world model is an internal representation that allows an agent to imagine, forecast, and interact with its environment in the 3D space.
Generative World Models: synthesize plausible 3D/4D worlds under conditions (e.g., text prompts, trajectories).
Predictive World Models: anticipate the future evolution of 3D/4D scenes given past observations and actions.
Together, these models provide the foundation for simulation, planning, and embodied intelligence in complex environments.