Carview!

DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou,
Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, Hao Zhao

ICCV 2025

DiST-4D is the first framework to achieve feed-forward dynamic 4D driving scene generation with both temporal extrapolation and spatial novel view synthesis.

💫 Framework

DiST-4D is a disentangled spatiotemporal diffusion framework for 4D driving scene generation, leveraging metric depth as the core geometric representation to enable both temporal extrapolation and spatial novel view synthesis (NVS). (Top: Temporal Generation) DiST-T employs a diffusion model to predict future multi-camera RGB-D sequences from historical multi-camera images and control signals. The generated RGB-D sequences are then aggregated into point clouds, allowing for bullet time rendering. (Bottom: Spatial Generation) To enable spatial NVS, DiST-S leverages the predicted RGB-D sequences to generate novel viewpoints by first projecting them into sparse conditions and then refining them into dense RGB-D outputs.

🔆 News

[2025/7]: Code and pre-trained weights are released.
[2025/6]: Paper is accepted on ICCV 2025.
[2025/3]: Check out our other latest works on generative world models: UniScene, MuDG, HERMES.
[2025/3]: Paper is on arxiv.
[2025/3]: Demo is released on Project Page.

👀 Abstract

Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.

🕹️ Getting Started

Prepare NuScenes Dataset

Download nuScenes dataset Download all splits of Trainval in Full dataset (v1.0) to your device following nuscenes official instructions and put them to "/data/nuscenes".
Download advanced_12Hz_trainval metadata from MagicDrive

$./data/nuscenes
├── samples
├── sweeps
├── ...
├── v1.0-trainval
└── advanced_12Hz_trainval

Download metadata (preprocessing pkl) as MagicDriveDiT and put them to "/data/nuscenes_mmdet3d-12Hz/"

Preprocess Depth of NuScenes

conda environment

pre_py39: depth_process/requirement/pre_py39.txt which is modified from DepthLab
mvs_py38: depth_process/requirement/mvs_py38.txt which is used for semantic segmentation in visual reconstruction.

DownLoad the ckpt of DepthLab in depth_process/DepthLab/checkpoints
Prepare the ckpt of SegFormer to depth_process/visualrecon/SemSeg/SegFormer and Oneformer to depth_process/visualrecon/SemSeg/OneFormer
Preprocess one scene in NuScenes (for example indice = 1)

# MVS part
cd depth_process
./visualrecon/mvs_pipe_nus.sh $indice
# DepthLab refine depth
./DepthLab/scripts/infer_nus_video_mp.sh $indice

Note: In order to preprocessing all scene, you need to run the above script from indice=0 to indice=849, which is time consuming. Since we already have multi-processing setup in our MVS code, it is more recommended to use a loop to process each scene like depth_process/visualrecon/batch_mvs_nus.sh

Semantic Segmenation with Oneformer form scene-0 to scene-849 using 8gpu and mp=1

cd depth_process/visualrecon
./Batch_SemSeg_nus_oneformer.sh 0 849 8 1
# We recommend to use multiple processes or nodes to handle the above tasks
#./Batch_SemSeg_nus_oneformer.sh 0 99 8 1
#./Batch_SemSeg_nus_oneformer.sh 100 199 8 1
# ...

The processed depth and semantic will be put in depth_process/nus_Rdepth (A scene takes up about 3GB)

We have also provided some preprocessed scenes on the validation set to facilitate testing. Please download these scenes from Hugging Face

DiST-T

conda environment:

distt: DiST_T/requirement/distt.txt which is modified from MagicDriveDiT

Prepare CogVideoX VAE and T5 emcoder following MagicDriveDiT and put them in depth_process/DepthLab/checkpoints
Training and inference code.

# Train 
cd DiST_T
./train_dist_mm_424_onlyRGB.sh
# Infer
cd DiST_T
./infer_dist_dataset_fullval.sh

Only inference the DiST-T model for test

Download the pretrained DiST-T ckpt and put it in DiST_T/ckpt/outputs_424_onlyRGB/
Run the inference code

./infer_dist_dataset_fullval.sh

DiST-S

conda environment

dists: DiST_S/requirement/dists.txt which is modified from FreeVS

Data process

cd DiST_S
#Train set forward or backward projection (current only +2frame)
./data_process/batch_reproj_nus_train_2frame.sh $start_idx $end_idx
#For example:
#./data_process/batch_reproj_nus_train_2frame.sh 0 700
#Val set lateral camera move (+1m +2m +4m)
./data_process/batch_reproj_nus_val_movecam.sh $start_idx $end_idx
#For example:
#./data_process/batch_reproj_nus_val_movecam.sh 0 150

Note: We recommend to use multiple processes for data process. The processed condition is in DiST_S/reproj_nus

Download SVD ckpt from https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt. and put it in DiST_S/pretrained
Train stage1

cd DiST_S/diffusers/
./examples/scripts/mm_train_nus_768.sh

SCC Data process

cd DiST_S/diffusers/
./examples/scripts/mm_infer_nus_cycle_reproj.sh

Train stage2 with SCC Data

cd DiST_S/diffusers/
./examples/scripts/mm_train_nus_modVis.sh

Final inference after SCC training

cd DiST_S/diffusers/
./examples/scripts/mm_infer_nus_mp_768_modVis.sh

Only inference the DiST-S model for test

Run the data process of Val set.

# lateral camera move (+1m +2m +4m) 
./data_process/batch_reproj_nus_val_movecam.sh $start_idx $end_idx

Download the pretrained DiST-S ckpt and put it in DiST_S/ckpt/mm_multi_cam_hybridD_valid_mask_768_mod_cycle_p
Run the inference code

cd DiST_S/diffusers/
./examples/scripts/mm_infer_nus_mp_768_modVis.sh

🙏 Acknowledgements

We would like to thank the contributors of the following repositories for their valuable contributions to the community:

😉 Citation

If you find our paper and code useful for your research, please consider citing:

@article{guo2024dist4d,
  title={DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation},
  author={Jiazhe Guo and Yikang Ding and Xiwu Chen and Shuo Chen and Bohan Li and Yingshuang Zou and Xiaoyang Lyu and Feiyang Tan and Xiaojuan Qi and Zhiheng Li and Hao Zhao},
  journal={arXiv preprint arXiv:2503.15208},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
DiST_S		DiST_S
DiST_T		DiST_T
assets		assets
depth_process		depth_process
misc		misc
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

💫 Framework

🔆 News

👀 Abstract

🕹️ Getting Started

Prepare NuScenes Dataset

Preprocess Depth of NuScenes

DiST-T

Only inference the DiST-T model for test

DiST-S

Only inference the DiST-S model for test

🙏 Acknowledgements

😉 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

royalmelon0505/dist4d

Folders and files

Latest commit

History

Repository files navigation

💫 Framework

🔆 News

👀 Abstract

🕹️ Getting Started

Prepare NuScenes Dataset

Preprocess Depth of NuScenes

DiST-T

Only inference the DiST-T model for test

DiST-S

Only inference the DiST-S model for test

🙏 Acknowledgements

😉 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages