| CARVIEW |
Select Language
Abstract
Single-image 3D reconstruction remains a fundamental challenge in computer vision due to inherent geometric ambiguities and limited viewpoint information.
Recent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D priors learned from large-scale video data.
However, leveraging these priors effectively faces three key challenges: (1) degradation in quality across large camera motions,
(2) difficulties in achieving precise camera control, and (3) geometric distortions inherent to the diffusion process that damage 3D consistency.
We address these challenges by proposing LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency.
Specifically, we design an articulated trajectory strategy to generate video frames, which decomposes video sequences with large camera motions into ones with controllable small motions.
Then we use robust neural matching models, i.e. MASt3R, to calibrate the camera poses of generated frames and produce corresponding point clouds.
Finally, we propose a distortion-aware 3D Gaussian splatting representation, which can learn independent distortions between frames and output undistorted canonical Gaussians.
Extensive experiments demonstrate that LiftImage3D achieves state-of-the-art performance on two challenging datasets,
i.e. LLFF and DL3DV, and generalizes well to diverse in-the-wild images, from cartoon illustrations to complex real-world scenes.
Interactive Viewer
Click on the images below to render 3D scenes in real-time in your browser, powered by Brush!
Note that the quality may be reduced. We will also provide a local viewer
Framework
The overall pipeline of LiftImage3D. Firstly, we extend LVDM to generate diverse video clips from a single image using an
articulated camera trajectory strategy. Then all generated frames are matching using the robust neural matching module and registered in
to a point cloud. After that we initialize Gaussians from registered point clouds and construct a distortion field to model the independent
distortion of each video frame upon canonical 3DGS.
Single Image to 3D Scene which Can Be Dragged freely
Citation
@misc{chen2024liftimage3d,
title={LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors},
author={Yabo Chen and Chen Yang and Jiemin Fang and Xiaopeng Zhang and Lingxi Xie and Wei Shen and Wenrui Dai and Hongkai Xiong and Qi Tian},
year={2024},
eprint={2412.09597},
archivePrefix={arXiv},
primaryClass={cs.CV}
}