| CARVIEW |
Omnidata: (Steerable Datasets)
A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans
Ainaz Eftekhar*,
Alexander Sax*,
Roman Bachmann,
Jitendra Malik,
Amir Zamir
ICCV 2021
Why make a 3D scan → 2D multiview pipeline?
i. A means to train state-of-the-art models
Depth estimation on OASIS images. We trained the method of MiDaS DPT-Hyprid (Ranftl et al. 2021), but only on a starter datset of data output from our pipeline (Omni).
Surface normals extracted from depth predictions. The high-resolution meshes in the starter dataset also seem to produce networks that make more precies shape predictions, as shown by the surface normal vectors extracted from the predictions in the bottom row.
Surface normal prediction on OASIS images. Neither model saw OASIS images during training.
Surface normal prediction on OASIS images. The Omnidata-trained model outperformed the baseline model trained on OASIS data itself.
For a complete list of labels and how they are produced, see the annotator GitHub. To try out the pretrained models, upload an image to a live demo or download the PyTorch weights. Or train your own model using the starter data and dataloaders.
ii. Avenue to a dataset 'design guide'
By capturing as much information as possible and then parametrically resampling that data into 3D images, we can probe the effects of different sampling distributions and data domains. For example, previous research has identified various types of selection bias such as photographer’s bias, viewpoint bias. Choice of sensor (e.g. RGB vs. LIDAR) also affects what information is available to the model.
These choices have real impact. For example, selecting different aperture sizes changes the makeup of images (below and left). In effect, making an dataset more or less object-centric. Play with some of these effects in our dataset design demo or make one yourself with one of the one-line examples in our annotator quickstart.
Dataset field-of-view influences image content.
e.g. FoV is correlated with object-level focus.
iii. Matched-pair analysis
ImageNet features are not always best (in vision, in robotics), but determining why an ImageNet classification pretraining is often better than NYU depth estimation pretraining is difficult because ImageNet has no depth labels. So the effect of the pretraining data cannot be separated from the pretraining task.
With the Omnidata pipeline, we can annotate the same dataset for both tasks, in order to determine whether it is the classification pretraining or the image distribution that is doing the heavy lifting. Similarly, the pipeline gives us full control over dataset generation in order to determine the impact of single- vs multi-view training, using correspondences, geometry, camera pose, etc.
Cross-task comparisons are complicated by confounding factors. Comparing two pretrained models utility for transfer learning is difficult when the two models were trained on disjoint datasets with different parameters: domains, numbers of images, sensor types, resolutions, etc.
iv. Large datasets for even non-recognition spatial tasks
13 of 21 mid-level cues from the Annotator. Each label/cue is produced for each RGB view/point combination, and there are guaranteed to be 'k' views of each point.
For complete information about that starter dataset (including tools to download it), see the data docs.
Annotator Overview:
The annotator takes in one of the following inputs and generates a static vision dataset of multiple mid-level cues (21 in the first release).
Annotator: inputs and outputs.
The annotator generates images and videos of aligned mid-level cues, given an untextured mesh, a texture/aligned RGB images, and an optional pre-generated camera pose file. A 3D pointcloud can be used as well: simply mesh the pointcloud using a standard mesher like COLMAP (result shown above).
Static views of camera/point combinations. Multi-view constraints guarantee at least k views of each point.
Videos of interplated trajectories. The annotator can also generate videos by interpolating between cameras.
All mid-level cues are available for each frame. The following figure shows a few of these cues on a building from the Replica dataset.
5 of 21 outputs (video sampling).
How does the pipeline do this? It creates cameras, points, and views in 4 stages (below). For more information, check out the paper or annotator repo.
Omnidata Ecosystem:
Annotator
The annotator github contains examples, documentation, a Dockerized runnable container, and the raw code.
↪ Annotator GitHub
Starter Data
The omnitools CLI contains parallelized scripts to download and manipulate some or all of the 14 million image starter data. These scripts can also be reused to manipulate data generated from the annotator.
↪ Starter Data
Tooling
The tooling repo contains many of the tools that we found useful during the project: PyTorch dataloaders for annotator-produced data, data transformations, training pipelines, and our reimplementation of MiDaS.
↪ Tooling Github