| CARVIEW |
Select Language
TL;DR
- A new single-view scene reconstruction method that reasons faithful scene/object geometry with partial visual observations.
- A VL modulation module that enriches per-point features with fine-grained semantics from visual and text features.
- A VL spatial attention that aggregates point representations of the scene for accurate predictions aware of the neighboring 3D semantic context.
Overview
Given an input image \(\textbf{I}_{0}\), we use two image encoders to obtain features (\(F_{\text{app}}\), \(F_{\text{vis}}\)), and fuse these into feature map \(F_{\text{fused}}\). We further extract category-level text features and a segmentation map \(S\). For a given 3D point set \(\mathbf{X}\), we query the extracted features by projecting them onto the image plane yielding point-wise visual and text features. Next, the VL modulation layers endow the point representation with fine-grained semantic information. Finally, the VL spatial attention aggregates these point representations across the 3D scene, yielding density predictions aware of the 3D semantic context.
Visual Comparisons
Ours
BTS [Wimbauer, 2023]
Ours
PixelNeRF [Yu, 2021]
Ours
MonoDepth2 [Godard, 2019]
Scene Reconstruction
Compared to previous methods that struggle with corrupted and trailing shapes, our method produces faithful scene geometry, especially for occluded areas.
Object Reconstruction
Our method produces more faithful object geometries for various semantic categories.
BibTeX
@inproceedings{li2024know,
title={Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning},
author={Li, Rui and Fischer, Tobias and Segu, Mattia and Pollefeys, Marc and Van Gool, Luc and Tombari, Federico},
booktitle={CVPR},
year={2024}
}
|
|