| CARVIEW |
4D-LRM:
Large Space-Time Reconstruction Model From and To Any View at Any Time
Sai Bi1 Kai Zhang1 Ziwen Chen1,5 Sihan Xu2 Jianing Yang1,2
Zexiang Xu1 Kalyan Sunkavalli1 Mohit Bansal3 Joyce Chai2 Hao Tan1
UNC Chapel Hill3 University of Virginia4 Oregon State University5
NeurIPS 2025
TL;DR
Abstract
Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e.g., optimizationbased, geometry-based, or generative, that struggle with efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time, enabling fast, high-quality rendering at, in principle, infinite frame rate. Our results demonstrate that scaling spatiotemporal pretraining enables accurate and efficient 4D reconstruction. We show that 4D-LRM generalizes to novel objects, interpolates across time, and handles diverse camera setups. It reconstructs 24-frame sequences in one forward pass with less than 1.5 seconds on a single A100 GPU.
Key Insights
- 4D-LRM adopts a clean and minimal Transformer design to reconstruct dynamic objects from sparse, posed views across arbitrary times and viewpoints.
- 4D-LRM unifies space and time by predicting 4D Gaussian primitives directly from multi-view tokens.
- 4D-LRM scales effectively with data and model size with strong generalization and efficient inference.
Method
4D Point Cloud Visualization
4D-LRM Results (Consistent4D)
Front View
Back View
Left View
Right View
Turntable View
Front View
Back View
Left View
Right View
Turntable View
Front View
Back View
Left View
Right View
Turntable View
4D-LRM Results (Objaverse Test)
Front View
Back View
Left View
Right View
Turntable View
Front View
Back View
Left View
Right View
Turntable View
Front View
Back View
Left View
Right View
Turntable View
Front View
Back View
Left View
Right View
Turntable View
Front View
Back View
Left View
Right View
Turntable View
Front View
Back View
Left View
Right View
Turntable View
Scaling Behaviors
Training-Time Scaling
- 4D-LRM-Base: Transformer with 768 hidden dimensions, 12 layers, and 12 attention heads, trained with 12 random input views and 12 random target views. No free Gaussians.
- #Target x 2: Trained with 12 random input views and 24 random target views.
- w/ Hexplane: Instead of unified space-time representation, an alternative 4DGS representation with decomposed neural voxel encoding inspired by HexPlane.
- w/ Temp Align: Similar to the idea of pixel-aligned Gaussians, we force μT to the input frame time, reducing the parameterization to dim4DGS = 19.
- w/ Free GS: Trained with N = 1024 free Gaussian tokens from scratch.
Inference-Time Scaling
BibTeX
@article{ma20254dlrm,
title={4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time},
author={Ziqiao Ma and Xuweiyi Chen and Shoubin Yu and Sai Bi and Kai Zhang and Ziwen Chen and Sihan Xu and Jianing Yang and Zexiang Xu and Kalyan Sunkavalli and Mohit Bansal and Joyce Chai and Hao Tan},
year={2025},
journal={arXiv:2506.18890},
}