CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Mon, 24 Mar 2025 23:15:31 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"67e1e793-3db6" expires: Mon, 29 Dec 2025 02:28:17 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: C2DE:444BC:83455C:9372D3:6951E4E9 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 02:18:17 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210057-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766974697.153858,VS0,VE240 vary: Accept-Encoding x-fastly-request-id: 7002f5ce885bd5ceda2f6cff22563f5138e3f48a content-length: 4510 Cap3D

Cap3D

View Selection for 3D Captioning via Diffusion Ranking

Tiange Luo, Justin Johnson^†, Honglak Lee^† (†: equal advising)

Scalable 3D Captioning with Pretrained Models

Tiange Luo*, Chris Rockwell*, Honglak Lee^†, Justin Johnson^† (*: equal contribution, †: equal advising)

Data is hosted at [Huggingface], including 1,006,782 descriptive captions for 3D objects in Objaverse and Objaverse-XL, associated with point clouds (16,384 colorful points), and 20 rendered images along with camera details (intrinsic & extrinsic), depth data, and masks.
Our code for captioning, rendering, and view selection are released in [Github]
Our code for finetuning text-to-3D models are released in [Github]
Some of our fine-trained model checkpoints can be found in [Huggingface].
Compositional and general descriptive captions for 3D objects in the ABO dataset is at [Huggingface]
General descriptive captions for 3D objects in the ShapeNet dataset is at [Huggingface]

Our experimental findings indicate that the rendering views significantly impacts the performance of 3D captioning with image-based models, such as BLIP2 and GPT4-Vision. Especially, our method, which utilizes 8 rendering views, achieves higher quality, less hallucination, and more detailed captions than GPT4-Vision with 28 views.

Both Cap3D and our newer method (DiffuRank) render input 3D objects into multiple views for caption generation (green steps). However, while Cap3D consolidates these captions into a final description (blue steps), DiffuRank employs a pre-trained text-to-3D diffusion model to identify views that better match the input object’s characteristics. These selected views are then processed by a Vision-Language Model (we used GPT4-Vision) for captioning (orange steps).

Randomly sampled selected views by DiffuRank. The left row features the top-6 views as ranked by DiffuRank, while the right row displays the bottom-6. We adopt two different kinds of rendering, and notice that DiffuRank can select the views with the appropriate rendering that highlight object features.

Objaverse: A Universe of Annotated 3D Objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, Ali Farhadi

ABO: Dataset and Benchmarks for Real-World 3D Object Understanding

Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F. Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, Jitendra Malik

GPT-4 Technical Report

OpenAI

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi

Objaverse-XL: A Universe of 10M+ 3D Objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, Ali Farhadi

Original Source | Taken Source