| CARVIEW |
Select Language
HTTP/2 301
server: GitHub.com
content-type: text/html
location: https://jxbbb.github.io/TOD3Cap/
x-github-request-id: DE83:318CF6:8E129E:9F853B:69528EF4
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 14:23:49 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210061-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767018229.293252,VS0,VE201
vary: Accept-Encoding
x-fastly-request-id: 65bb0f055b1974e0b33d8f79df3a3bc2a2a8802d
content-length: 162
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Fri, 21 Jun 2024 04:26:32 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"667500f8-9e0f"
expires: Mon, 29 Dec 2025 14:33:49 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: D9E2:2916CC:8FD5D3:A14B56:69528EEA
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 14:23:49 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210061-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767018230.508097,VS0,VE220
vary: Accept-Encoding
x-fastly-request-id: c5967c7f61898614d15844ca84e1279f900821e9
content-length: 8700
TOD3Cap
1 CASIA, 2 Li Auto, 3 AIR, Tsinghua University, 4 Beihang University, 5 HKUST, 6 HKU,



TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes
Bu Jin1,2,
Yupeng Zheng1,2*,
Pengfei Li3,
Weize Li3,
Yuhang Zheng4,
Sujie Hu3,
Xinyu Liu5, Jinwei Zhu3, Zhijie Yan3, Haiyang Sun2, Kun Zhan2, Peng Jia2,
Xiaoxiao Long6, Yilun Chen3, Hao Zhao3,
(*Indicates Corresponding Author)
Contact: jinbu18@mails.ucas.ac.cn, zhengyupeng2022@ia.ac.cn
Xinyu Liu5, Jinwei Zhu3, Zhijie Yan3, Haiyang Sun2, Kun Zhan2, Peng Jia2,
Xiaoxiao Long6, Yilun Chen3, Hao Zhao3,
(*Indicates Corresponding Author)
Contact: jinbu18@mails.ucas.ac.cn, zhengyupeng2022@ia.ac.cn
1 CASIA, 2 Li Auto, 3 AIR, Tsinghua University, 4 Beihang University, 5 HKUST, 6 HKU,
Contributions
- We introduce the outdoor 3D dense captioning task to densely detect and describe 3D objects, using LiDAR point clouds along with a set of panoramic RGB images as inputs. Its unique challenges are highlighted in Fig.1.
- We provide the TOD3Cap dataset containing 2.3M descriptions of 63.4k instances in outdoor scenes and adapt existing state-of-the-art approaches on our proposed TOD3Cap dataset for benchmarking.
- We show that our method outperforms the baselines adapted from representative indoor methods by a significant margin (+9.6 CiDEr@0.5IoU).
Teaser of TOD3Cap

Fig1: We introduce the task of 3D dense captioning in outdoor scenes (right). Given point clouds (right middle) and multi-view RGB inputs (right top), we predict box-caption pairs of all objects in a 3D outdoor scene. There are several fundamental domain gaps (middle column) between indoor and outdoor scenes, including Status, Point Cloud, Perspective, and Scene Area, bringing new challenges specific to outdoor scenes. Meanwhile, our outdoor 3D dense captioning (right bottom) contains more comprehensive concepts than indoor scenes (left bottom).
Pipeline of TOD3Cap

Fig2: Architecture of our proposed TOD3Cap network. Firstly, BEV features are extracted from 3D LiDAR point cloud and 2D multi-view images, followed by a query-based detection head that generates a set of 3D object proposals from the BEV features. Secondly, to capture the relationship information, we utilize a Relation Q-Former where the objects interact with other objects and the surrounding environment to get the context-aware features. Finally, with an Adapter, the features are processed to be prompts for the language model to generate dense captions. This formulation does not require a re-training process of the language model.
Qualitative Results

Fig3: Qualitative results for our proposed TOD3Cap network. In the top left, we show our predicted bounding boxes and corresponding captions in the first row and ground truth in the second row. In the top right, we show our predicted bounding boxes in blue and the ground truth bounding boxes in red. In the bottom, we mark the wrong descriptions in red. The TOD3Cap network produces impressive results except for a few mistakes.
Acknowledgement
We would like to thank Dave Zhenyu Chen at Technical University of Munich for his valuable proofreading and insightful suggestions. We would also like to thank Lijun Zhou and the student volunteers at Li Auto for their efforts in building the TOD3Cap dataset.BibTeX
@article{jin2024tod3cap,
title={TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes},
author={Jin, Bu and Zheng, Yupeng and Li, Pengfei and Li, Weize and Zheng, Yuhang and Hu, Sujie and Liu, Xinyu and Zhu, Jinwei and Yan, Zhijie and Sun, Haiyang and others},
journal={arXiv preprint arXiv:2403.19589},
year={2024}}