| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Mon, 01 Dec 2025 04:25:01 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"692d189d-2b41"
expires: Sun, 28 Dec 2025 22:15:54 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 2ADE:272D88:8145D0:91006E:6951A9C2
accept-ranges: bytes
age: 0
date: Sun, 28 Dec 2025 22:05:54 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210077-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766959555.532670,VS0,VE209
vary: Accept-Encoding
x-fastly-request-id: 4793583bbd586a8f6394c5ea61f5ced905048461
content-length: 3281
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| This paper presents a novel method for building scalable 3D generative models utilizing pre-trained video diffusion models. The primary obstacle in developing foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire. This results in a significant disparity in scale compared to the vast quantities of other types of data. To address this issue, we propose using a video diffusion model, trained with extensive volumes of text, images, and videos, as a knowledge source for 3D data. By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view data, can generate a 3D asset from a single image in seconds and achieves superior performance when compared to current SOTA feed-forward 3D generative models, with users preferring our results over 90% of the time. |
Overall pipeline
|
| The pipeline of VFusion3D. We first use a small amount of 3D data to fine-tune a video diffusion model, transforming it into a multi-view vodeo generator that functions as a data engine. By generating a large amount of synthetic data, we train VFusion3D to generate a 3D representation and render novel views. |
Results
Generated Images (Text-Image-3D)
|
Single Image 3D Reconstruction
|
User Study
|
Scaling!
Scaling with the number of synthetic data
|
| The left and right figures display the LPIPS and CLIP image similarity scores in relation to the dataset size, respectively. The generation quality consistently improves as the dataset size increases |
Scaling with other factors |
![]() |
Han et al. VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models preprint (hosted on ArXiv) |
Acknowledgements |
