| CARVIEW |
Unsupervised Learning of 3D Object Categories from Videos in the Wild
Overview
We present a novel deep architecture that contributes Warp-conditioned Ray Embedding (WCR) to reconstruct and render new views (right) of object categories from one or few input images (middle). Our model is learned automatically from videos of the objects (left) and works on difficult real data where competitor architectures fail to produce good results
Method
Our method takes as input an image and produces per pixel features using a U-Net. We then shoot rays from a target view and retrieve per-pixel features from one or multiple source images. Once all spatial feature vectors are aggregated into a single feature vector, we combine them with their harmonic embeddings and pass them to an MLP yielding per location colors and opacities. Finally, we use differentiable raymarching to produce a rendered image.
Dataset
For the purpose of studying learning 3D object categories in the wild, we crowd-sourced a large collection of videos from AMT.
Results
Single image reconstruction
| Source image | Mesh | Voxel | Voxel+MLP | MLP | Ours |
|---|---|---|---|---|---|
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
Multi-view reconstruction
Donut
# Source images
# Source images used for reconstruction
Hydrant
# Source images
# Source images used for reconstruction
BibTeX
@article{henzler2021unsupervised,
author = {Henzler, Philipp and Reizenstein, Jeremy and Labatut, Patrick and Shapovalov, Roman and Ritschel, Tobias and Vedaldi, Andrea and Novotny, David},
title = {Unsupervised Learning of 3D Object Categories from Videos in the Wild},
journal = {CVPR},
year = {2021},
}