CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://www.normanm.de/MultiDiff x-github-request-id: 5B7E:3FD64F:81EB6B:91F822:6951D29C accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 01:00:14 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210070-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766970014.282281,VS0,VE196 vary: Accept-Encoding x-fastly-request-id: 4f436af163a597dc34fbcd5ce791b006a831a464 content-length: 162 HTTP/1.1 301 Moved Permanently Connection: keep-alive Content-Length: 162 Server: GitHub.com Content-Type: text/html Location: https://www.normanm.de/MultiDiff/ X-GitHub-Request-Id: C15B:2D8B9D:81B548:91C02D:6951D29E Accept-Ranges: bytes Age: 0 Date: Mon, 29 Dec 2025 01:00:14 GMT Via: 1.1 varnish X-Served-By: cache-bom-vanm7210049-BOM X-Cache: MISS X-Cache-Hits: 0 X-Timer: S1766970015.651583,VS0,VE201 Vary: Accept-Encoding X-Fastly-Request-ID: f885f03be03dc3cf059ae0eda1d8ec6fd3df5bf8 HTTP/1.1 200 OK Connection: keep-alive Content-Length: 4486 Server: GitHub.com Content-Type: text/html; charset=utf-8 Last-Modified: Thu, 27 Jun 2024 16:00:17 GMT Access-Control-Allow-Origin: * ETag: W/"667d8c91-4e54" expires: Mon, 29 Dec 2025 01:10:14 GMT Cache-Control: max-age=600 Content-Encoding: gzip x-proxy-cache: MISS X-GitHub-Request-Id: 0FEF:2BC55:81EB96:91F5EE:6951D29E Accept-Ranges: bytes Age: 0 Date: Mon, 29 Dec 2025 01:00:15 GMT Via: 1.1 varnish X-Served-By: cache-bom-vanm7210049-BOM X-Cache: MISS X-Cache-Hits: 0 X-Timer: S1766970015.865928,VS0,VE219 Vary: Accept-Encoding X-Fastly-Request-ID: b926bd003b3bfd000ee7cbb08588bf2ab540f216 MultiDiff: Consistent Novel View Synthesis from a Single Image

MultiDiff: Consistent Novel View Synthesis from a Single Image

CVPR 2024

Norman Müller¹, Katja Schwarz¹, Barbara Roessle², Lorenzo Porzi¹, Samuel Rota Bulò¹, Matthias Nießner² Peter Kontschieder¹,

¹Meta Reality Labs ²Technical University of Munich,

Paper arXiv Video

MultiDiff enables camera-motion control for scene-level novel view synthesis. Given a single RGB image and a camera trajectory of choice, the model generates 3D-consistent views extrapolating from the input image.

Abstract

We introduce MultiDiff, a novel approach for consistent novel view synthesis of scenes from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed by nature, as there exist multiple, plausible explanations for unobserved areas. To address this issue, we incorporate strong priors in form of monocular depth predictors and video-diffusion models. Monocular depth enables us to condition our model on warped reference images for the target views, increasing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes, allowing the model to learn continuous and pixel-accurate correspondences across generated images. In contrast to approaches relying on autoregressive image generation that are prone to drifts and error accumulation, MultiDiff jointly synthesizes a sequence of frames yielding high-quality and multi-view consistent results -- even for long-term scene generation with large camera movements, while reducing inference time by an order of magnitude. For additional consistency and image quality improvements, we introduce a novel, structured noise distribution. Our experimental results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet. Finally, our model naturally supports multi-view consistent editing without the need for further tuning.

Video

Method

MultiDiff leverages strong depth and video diffusion priors to enable consistent novel view synthesis of scenes from a single RGB image using a novel correspondence attention layer.

Novel-view rendering results following the GT trajectory.

By warping the initial noise according to the estimated depth into the target novel views, we can structure the noise providing additional information about the 3D scene structure. Just like Neo in "The Matrix", the model can decode this abstract noise pattern in more consistent views.

By masking areas in the input image, MultiDiff naturally enables consistent editing without the need for finetuning.

BibTeX

@InProceedings{Muller_2024_CVPR,
                author    = {M\"uller, Norman and Schwarz, Katja and R\"ossle, Barbara and Porzi, Lorenzo and Bul\`o, Samuel Rota and Nie{\ss}ner, Matthias and Kontschieder, Peter},
                title     = {MultiDiff: Consistent Novel View Synthesis from a Single Image},
                booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
                month     = {June},
                year      = {2024},
                pages     = {10258-10268}
            }

Original Source | Taken Source