| CARVIEW |
Cap3D
View Selection for 3D Captioning via Diffusion Ranking
Tiange Luo, Justin Johnson†, Honglak Lee† (†: equal advising)
Scalable 3D Captioning with Pretrained Models
Tiange Luo*, Chris Rockwell*, Honglak Lee†, Justin Johnson† (*: equal contribution, †: equal advising)
- NeurIPS 2023
- Paper |
- Code |
- Dataset |
- Slides |
- Poster (ICCV Workshop) |
- BibTeX
=Resources
- Data is hosted at [Huggingface], including 1,006,782 descriptive captions for 3D objects in Objaverse and Objaverse-XL, associated with point clouds (16,384 colorful points), and 20 rendered images along with camera details (intrinsic & extrinsic), depth data, and masks.
- Our code for captioning, rendering, and view selection are released in [Github]
- Our code for finetuning text-to-3D models are released in [Github]
- Some of our fine-trained model checkpoints can be found in [Huggingface].
- Compositional and general descriptive captions for 3D objects in the ABO dataset is at [Huggingface]
- General descriptive captions for 3D objects in the ShapeNet dataset is at [Huggingface]
=Overview
Our experimental findings indicate that the rendering views significantly impacts the performance of 3D captioning with image-based models, such as BLIP2 and GPT4-Vision. Especially, our method, which utilizes 8 rendering views, achieves higher quality, less hallucination, and more detailed captions than GPT4-Vision with 28 views.
Randomly sampled selected views by DiffuRank. The left row features the top-6 views as ranked by DiffuRank, while the right row displays the bottom-6. We adopt two different kinds of rendering, and notice that DiffuRank can select the views with the appropriate rendering that highlight object features.
=Related Publication
Objaverse: A Universe of Annotated 3D Objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, Ali Farhadi
ABO: Dataset and Benchmarks for Real-World 3D Object Understanding
Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F. Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, Jitendra Malik
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi
Objaverse-XL: A Universe of 10M+ 3D Objects
Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, Ali Farhadi
Thank this template. Accessibility.