VideoComp

VideoComp provides a benchmark for training and evaluating fine-grained video-text compositionality. It tests a model's ability to capture compositional and temporal coherence in multi-event videos. It introduces subtle disruptions to standard video-text pairs, such as temporal reordering, action word replacement, and segment-level mismatch, enabling evaluation of models’ alignment capabilities.

This release includes two benchmark datasets:

ActivityNet-Comp
YouCook2-Comp

These datasets extend the dense video captioning datasets ActivityNet-Captions and YouCook2, by generating LLM-rewritten multi-event paragraphs for both positive (coherent) and negative (disrupted) video-text pairs.

For more details, see our CVPR 2025 paper.

Dataset Format

We release .json annotation files for both training and validation splits. Each file contains a list of entries with the following fields.

NOTE: For the "segment-level mismatch" task, the input video should be trimmed by sampling the interval query_video/{start,end}_time from original_video/{start,end}_time.

key: Unique identifier
video_id: YouTube video ID
type: Disruptuion type
original_video/start_time, original_video/end_time: Start/end time of the original video
query_video/start_time, query_video/end_time: Start/end time of the query video (for actual train/eval)
positive_text: LLM-rewritten paragraph with chronologically ordered events
negative_text: LLM-rewritten paragraph with a targeted disruption
positive_text/start_time, positive_text/end_time: Start/end time for positive text
negative_text/start_time, negative_text/end_time: Start/end time for negative text
question: Question (for LLM evaluation)
answer: Answer (for LLM evaluation)

Downloads

Evaluation and Metrics

We report binary classification accuracy. For CLIP-style models, the task involves comparing the similarity between the video and each of the positive and negative texts, and predicting which one is a better match. For generative models, we format the task as a binary-choice question as in question, and check whether the model's output matches the correct answer in the form f"{answer}" == result or f"({answer})" == result. The "all" metric is computed as the product of the individual binary accuracies for "temporal reordering", "action word replacement", and "segment-level mismatch".

Citing this work

If you use this dataset or benchmark in your work, please cite:

@article{videocomp25,
  title={VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models},
  author={Dahun Kim and AJ Piergiovanni and Ganesh Mallya and Anelia Angelova},
  booktitle={CVPR},
  year={2025}
}

License and disclaimer

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoComp

Dataset Format

Downloads

Evaluation and Metrics

Citing this work

License and disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

License

google-deepmind/video_comp

Folders and files

Latest commit

History

Repository files navigation

VideoComp

Dataset Format

Downloads

Evaluation and Metrics

Citing this work

License and disclaimer

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages