DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation
Β
Β
Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou,
Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, Hao Zhao
ICCV 2025
DiST-4D is the first framework to achieve feed-forward dynamic 4D driving scene generation with both temporal extrapolation and spatial novel view synthesis.
DiST-4D is a disentangled spatiotemporal diffusion framework for 4D driving scene generation, leveraging metric depth as the core geometric representation to enable both temporal extrapolation and spatial novel view synthesis (NVS). (Top: Temporal Generation) DiST-T employs a diffusion model to predict future multi-camera RGB-D sequences from historical multi-camera images and control signals. The generated RGB-D sequences are then aggregated into point clouds, allowing for bullet time rendering. (Bottom: Spatial Generation) To enable spatial NVS, DiST-S leverages the predicted RGB-D sequences to generate novel viewpoints by first projecting them into sparse conditions and then refining them into dense RGB-D outputs.
- [2025/7]: Code and pre-trained weights are released.
- [2025/6]: Paper is accepted on ICCV 2025.
- [2025/3]: Check out our other latest works on generative world models: UniScene, MuDG, HERMES.
- [2025/3]: Paper is on arxiv.
- [2025/3]: Demo is released on Project Page.
Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.
-
Download nuScenes dataset Download all splits of Trainval in Full dataset (v1.0) to your device following nuscenes official instructions and put them to "/data/nuscenes".
-
Download advanced_12Hz_trainval metadata from MagicDrive
$./data/nuscenes
βββ samples
βββ sweeps
βββ ...
βββ v1.0-trainval
βββ advanced_12Hz_trainval
- Download metadata (preprocessing pkl) as MagicDriveDiT and put them to "/data/nuscenes_mmdet3d-12Hz/"
- conda environment
-
pre_py39:
depth_process/requirement/pre_py39.txtwhich is modified from DepthLab -
mvs_py38:
depth_process/requirement/mvs_py38.txtwhich is used for semantic segmentation in visual reconstruction.
-
DownLoad the ckpt of DepthLab in
depth_process/DepthLab/checkpoints -
Prepare the ckpt of SegFormer to
depth_process/visualrecon/SemSeg/SegFormerand Oneformer todepth_process/visualrecon/SemSeg/OneFormer -
Preprocess one scene in NuScenes (for example indice = 1)
# MVS part
cd depth_process
./visualrecon/mvs_pipe_nus.sh $indice
# DepthLab refine depth
./DepthLab/scripts/infer_nus_video_mp.sh $indiceNote: In order to preprocessing all scene, you need to run the above script from indice=0 to indice=849, which is time consuming. Since we already have multi-processing setup in our MVS code, it is more recommended to use a loop to process each scene like depth_process/visualrecon/batch_mvs_nus.sh
- Semantic Segmenation with Oneformer form scene-0 to scene-849 using 8gpu and mp=1
cd depth_process/visualrecon
./Batch_SemSeg_nus_oneformer.sh 0 849 8 1
# We recommend to use multiple processes or nodes to handle the above tasks
#./Batch_SemSeg_nus_oneformer.sh 0 99 8 1
#./Batch_SemSeg_nus_oneformer.sh 100 199 8 1
# ...The processed depth and semantic will be put in depth_process/nus_Rdepth (A scene takes up about 3GB)
We have also provided some preprocessed scenes on the validation set to facilitate testing. Please download these scenes from Hugging Face
- conda environment:
- distt:
DiST_T/requirement/distt.txtwhich is modified from MagicDriveDiT
-
Prepare CogVideoX VAE and T5 emcoder following MagicDriveDiT and put them in
depth_process/DepthLab/checkpoints -
Training and inference code.
# Train
cd DiST_T
./train_dist_mm_424_onlyRGB.sh
# Infer
cd DiST_T
./infer_dist_dataset_fullval.sh- Download the pretrained DiST-T ckpt and put it in
DiST_T/ckpt/outputs_424_onlyRGB/ - Run the inference code
./infer_dist_dataset_fullval.sh- conda environment
- dists:
DiST_S/requirement/dists.txtwhich is modified from FreeVS
- Data process
cd DiST_S
#Train set forward or backward projection (current only +2frame)
./data_process/batch_reproj_nus_train_2frame.sh $start_idx $end_idx
#For example:
#./data_process/batch_reproj_nus_train_2frame.sh 0 700
#Val set lateral camera move (+1m +2m +4m)
./data_process/batch_reproj_nus_val_movecam.sh $start_idx $end_idx
#For example:
#./data_process/batch_reproj_nus_val_movecam.sh 0 150Note: We recommend to use multiple processes for data process. The processed condition is in DiST_S/reproj_nus
-
Download SVD ckpt from https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt. and put it in
DiST_S/pretrained -
Train stage1
cd DiST_S/diffusers/
./examples/scripts/mm_train_nus_768.sh- SCC Data process
cd DiST_S/diffusers/
./examples/scripts/mm_infer_nus_cycle_reproj.sh- Train stage2 with SCC Data
cd DiST_S/diffusers/
./examples/scripts/mm_train_nus_modVis.sh- Final inference after SCC training
cd DiST_S/diffusers/
./examples/scripts/mm_infer_nus_mp_768_modVis.sh- Run the data process of Val set.
# lateral camera move (+1m +2m +4m)
./data_process/batch_reproj_nus_val_movecam.sh $start_idx $end_idx-
Download the pretrained DiST-S ckpt and put it in
DiST_S/ckpt/mm_multi_cam_hybridD_valid_mask_768_mod_cycle_p -
Run the inference code
cd DiST_S/diffusers/
./examples/scripts/mm_infer_nus_mp_768_modVis.shWe would like to thank the contributors of the following repositories for their valuable contributions to the community:
If you find our paper and code useful for your research, please consider citing:
@article{guo2024dist4d,
title={DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation},
author={Jiazhe Guo and Yikang Ding and Xiwu Chen and Shuo Chen and Bohan Li and Yingshuang Zou and Xiaoyang Lyu and Feiyang Tan and Xiaojuan Qi and Zhiheng Li and Hao Zhao},
journal={arXiv preprint arXiv:2503.15208},
year={2025}
}