| CARVIEW |
Zero-Shot Text-Guided Object
Generation with Dream Fields
CVPR 2022 and AI4CC 2022 (Best Poster)
Ajay
Jain
UC Berkeley, Google Research
Ben Mildenhall
Google Research
Jonathan T. Barron
Google Research
Pieter
Abbeel
UC Berkeley
Ben
Poole
Google Research
Abstract
We combine neural rendering with multi-modal image and text representations to synthesize diverse 3D objects solely from natural language descriptions. Our method, Dream Fields, can generate
the geometry and color of a wide range of objects without 3D supervision. Due to the scarcity of diverse, captioned 3D data, prior methods only generate objects from a handful of categories, such as ShapeNet. Instead, we guide generation
with image-text models pre-trained on large datasets of captioned images from the web. Our method optimizes a Neural Radiance Field from many camera views so that rendered images score highly with a target caption according to a pre-trained
CLIP model. To improve fidelity and visual quality, we introduce simple geometric priors, including sparsity-inducing transmittance regularization, scene bounds, and new MLP architectures. In experiments, Dream Fields produce realistic,
multi-view consistent object geometry and color from a variety of natural language captions.
Example generated objects
Dream Fields can be trained with diverse captions written by artists or from COCO. Descriptions control the style of generated objects, such as color and context.
bouquet of flowers sitting in a clear glass vase.
a sculpture of a rooster.
a robotic dog. a robot in the shape of a dog.
matte painting of a castle made of cheesecake surrounded by a moat made of ice cream; trending on artstation; unreal engine. [ref]
a beautiful epic wonderous fantasy painting of the ocean. [ref]
matte painting of a bonsai tree; trending on artstation.
a cluster of pine trees are in a barren area.
a boat on the water tied down to a stake.
a small green vase displays some small yellow blooms.
a bus covered with assorted colorful graffiti on the side of it.
a pile of crab is seasoned and well cooked.
a tray that has meat and carrots on a table.
a snowboard standing upright in a snow bank.
Compositional generation
The compositional nature of language allows users to combine concepts in novel ways and control generation. A template prompt describing a primary object (an armchair or a teapot) is stylized with 16 materials: avocado, glacier, orchid, pikachu, brain coral, gourd, peach, rubik's cube, doughnut, hibiscus, peacock, sardines, fossil, lotus root, pig, or strawberry. These prompt templates are sourced from DALL-E.
an archair in the shape of a ____.
an archair imitating a ____.
a teapot in the shape of a ____.
a teapot imitating a ____.
Related publications
Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis
Ajay Jain, Matthew Tancik, Pieter Abbeel ICCV 2021 International Conference on Computer Vision
DietNeRF regularizes Neural Radiance Fields with a CLIP-based loss to improve 3D reconstruction. Given only a few images of an object or scene, we reconstruct its 3D structure & render novel views using prior knowledge contained in large image encoders.
Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields
Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, Pratul Srinivasan ICCV 2021 International Conference on Computer Vision
NeRF is aliased, but we can anti-alias it by casting cones and prefiltering the positional encoding function. Dream Fields combine mip-NeRF's integrated positional encoding with Fourier features.
Citation
Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, Ben Poole. Zero-Shot Text-Guided Object Generation with Dream Fields. arXiv, 2021.
@article{jain2021dreamfields,
author = {Jain, Ajay and Mildenhall, Ben and Barron, Jonathan T. and Abbeel, Pieter and Poole, Ben},
title = {Zero-Shot Text-Guided Object Generation with Dream Fields},
joural = {CVPR},
year = {2022},
}