| CARVIEW |
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
A more efficient multimodal large language model series.
ICLR 2025
Video Detail Caption Leaderboard
🤩Welcome! Submit your scores now and watch the leaderboard refresh with your achievements!
Please remember to report your frame rate and tokens per frame with each submission.
We present a quantitative comparison between AuroraCap with existing state-of-the-art large multimodal models across various sections of structured captions in VDC. We use LLaMA-3.1-8B as the LLM evaluation assistant. # F stands for the frame sampling number of the input video, and TPF represents the visual tokens per frame. The average key frame number in VDC is 10.
We thank Aria team, VideoChat-Flash team, LLaVAction team, Apollo team, and Cockatiel team for their contributions to the leaderboard.
VDC Example
|
|
Abstract
Baseline: Video detailed captioning is a key task which aims to generate comprehensive and coherent textual descriptions of video content, benefiting both video understanding and generation. In this paper, we propose AuroraCap, a video captioner based on a large multimodal model. We follow the simplest architecture design without additional parameters for temporal modeling. To address the overhead caused by lengthy video sequences, we implement the token merging strategy, reducing the number of input visual tokens. Surprisingly, we found that this strategy results in little performance loss. AuroraCap shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and Gemini-1.5 Pro (82.2).
Benchmark and Metric: However, existing video caption benchmarks only include simple descriptions, consisting of a few dozen words, which limits research in this field. Therefore, we develop VDC, a video detailed captioning benchmark with over one thousand carefully annotated structured captions. In addition, we propose a new LLM-assisted metric VDCScore for bettering evaluation, which adopts a divide-and-conquer strategy to transform long caption evaluation into multiple short question-answer pairs. With the help of human Elo ranking, our experiments show that this benchmark better correlates with human judgments of video detailed captioning quality.
AuroraCap: A Efficient and Performant Video Detailed Captioner
Architecture
LLaVA.
Token merging.
Training Recipe
We use over 20 million high-quality image/video-text pairs to train AuroraCap in three stages. The training datasets are released at HuggingFace.
Pretraining stage. We first align visual features with the word embedding space of LLMs. To achieve this, we freeze the pretrained ViT and LLM, training solely the vision-language connector.
Vision stage. We unfreeze the pretrained ViT while freezing the LLM during vision stage and train with the public data among various computer vision tasks to get better generalization.
Language stage. Finally, we conduct end-to-end training, which means all the components are trainable, with the most high-quality public data during language stage.
VDC: A New Video Detailed Captioning Benchmark
Benchmark Collection and Processing
Video collection and processing.
We building VDC upon Panda-70M
| Dataset | Theme | # Video | # Clip | # Caption | # Word | # Vocab. | Ave. Length |
|---|---|---|---|---|---|---|---|
| MSVD |
Open | 1,970 | 1,970 | 70,028 | 607,339 | 13,010 | 8.67 |
| MSR-VTT |
Open | 7,180 | 10,000 | 200,000 | 1,856,523 | 29,316 | 9.28 |
| ActivityNet |
Open | 20,000 | 100,000 | 100,000 | 1,340,000 | 15,564 | 13.40 |
| S-MiT |
Open | 515,912 | 515,912 | 515,912 | 5,618,064 | 50,570 | 10.89 |
| M-VAD |
Movie | 92 | 48,986 | 55,905 | 519,933 | 18,269 | 9.30 |
| MPII-MD |
Movie | 94 | 68,337 | 68,375 | 653,467 | 24,549 | 9.56 |
| Youcook2 |
Cooking | 2,000 | 15,400 | 15,400 | 121,418 | 2,583 | 7.88 |
| Charades |
Human | 9,848 | 10,000 | 27,380 | 607,339 | 13,000 | 22.18 |
| VATEX |
Open | 41,300 | 41,300 | 413,000 | 4,994,768 | 44,103 | 12.09 |
| VDC (ours) | Open | 1,027 | 1,027 | 1,027 | 515,441 | 20,419 | 500.91 |
Structured detailed captions construction pipeline. We develop a structured detailed captions construction pipeline to generate extra detailed descriptions from various perspectives, significantly extending the length and enhancing the richness compared to previous benchmarks. The structured detailed captions includes camera, short, background, main object, and detailed captions.
- Camera caption. Describe the camera work in detail, including shot types, angles, movements, transitions, and any special effects used to enhance the video.
- Short caption. Summarize the video in one detailed sentence, capturing key actions and the overall mood.
- Background caption. Provide a detailed description of the background, including objects, location, weather, time, and any dynamic elements.
- Main Object caption. Give a thorough description of the main subject's actions, attributes, interactions, and movements throughout the video frames.
- Detailed caption. Generate a detailed, vivid caption for the video, covering all categories, ensuring it's engaging, informative, and rich enough for AI to recreate the video content.
To generate detailed, fine-grained, and accurate captions, we leverage GPT-4o to produce video descriptions. We design a hierarchical prompt strategy to efficiently obtain accurate structured captions and detailed captions in two conversation rounds: (1) Structured Captions Generation and (2) Detailed Captions Integration.
VDCscore: Evaluating Detailed Captions with LLMs
We introduce VDCscore, a novel quantitative metric that utilizes LLMs to evaluate the similarity between predicted and ground-truth detailed captions through a divide-and-conquer approach. The core idea of VDCscore is to decompose long detailed captions into multiple short question-answering pairs, avergae the evaluation of each pair as the final result.
Evaluation
Benchmarking video detailed captioning.
AuroraCap achieves superior performance in video detailed captioning while utilizing significantly fewer visual tokens than other models, fully highlighting the efficiency of AuroraCap.
Case Study
We perform an extensive case study of AuroraCap on a variety of videos for video detailed captioning. As shown as followings, AuroraCap is capable of providing excellent detailed captions regarding the camera motion, background and main object with less hallucination.
Our Related Work
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
The first long-form video open-ended benchmark.
BibTeX
@article{auroracap,
title={AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark},
author={Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, jenq-neng Hwang, Saining Xie, Christopher D. Manning},
year={2024},
journal={arXiv preprint arXiv:2410.03051},
}