We release the UTD dataset. Check the project webpage to download it. Our UTD dataset includes:
-
UTD-descriptions
Frame-level annotations for ~2 million videos, covering four conceptual categories visible in video frames:
objects, activities, verbs, and objects+composition+activities (namely detailed free-form frame descriptions).
Descriptions are provided for 8 uniformly sampled frames from the training and test/val splits of 12 widely-used activity recognition and text-to-video retrieval datasets. -
UTD-splits
Object-debiased test/val splits for all 12 datasets β subsets in which object-biased samples have been removed. For 6 activity recognition datasets, we also provide debiased-balanced splits that preserve the class distribution while removing the most object-biased samples.
Included datasets:
-
Activity recognition: UCF101, SSv2, Kinetics-400, Kinetics-600, Kinetics-700, Moments in Time
-
Text-to-video retrieval: MSR-VTT, YouCook2, DiDeMo, LSMDC, ActivityNet Captions, Spoken Moments in Time
π Download and read more on our webpage.
We benchmarked 30 video models for action classification and text-to-video retrieval on 12 datasets, using both original and UTD-debiased splits. We include models from VideoMAE, VideoMAE2, All-in-one, Unmasked Teacher (UMT), VideoMamba, and InternVid.
For action classification, we follow the VideoGLUE setup: using a frozen backbone and training a lightweight pooling head.
For text-to-video retrieval, we report zero-shot performance and follow model-specific reranking procedures.
We provide all model predictions to compute accuracy (e.g., Top-1, Recall@1) across splits.
π₯ Download predictions into ./videoglue_predictions/: GDrive link.
π₯ Download UTD splits into ./UTD_splits/ following instructions on our project webpage.
Accuracy of videomae-B-UH on full test set of ucf:
python utd/videoglue/get_split_results.py \
--pred_csv videoglue_predictions/ucf_videomae-B-UH/valid_49ep.csv \
--splits_path UTD_splits/splits_ucf_testlist01.json \
--split_name fullon our UDT debiased-balanced split:
python utd/videoglue/get_split_results.py \
--pred_csv videoglue_predictions/ucf_videomae-B-UH/valid_49ep.csv \
--splits_path UTD_splits/splits_ucf_testlist01.json \
--split_name debiased-balancedπ‘ You can also evaluate on your custom splits by providing a JSON file with the split IDs.
We tested env with cuda 11.6 only. Setup cuda path depending on your system (needed for videomamba):
module load cuda/11.6or
cuda_version=11.6
export PATH=/usr/lib/cuda-${cuda_version}/bin/:${PATH}
export LD_LIBRARY_PATH=/usr/lib/cuda-${cuda_version}/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_PATH=/usr/lib/cuda-${cuda_version}/
export CUDA_HOME=/usr/lib/cuda-${cuda_version}/Create python environment:
conda create -n videoglue python=3.9 -y
conda activate videoglue
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.6 -c pytorch -c conda-forge -y
pip install -r requirements_videoglue.txt
cd third_party/VideoMamba
pip install -e causal-conv1d/ # only for videomamba
pip install -e mamba/ # only for videomamba
cd -
pip install neptune==1.10.1 neptune-tensorboard==1.0.3
pip install -e . Follow DATASETS.md for instructions on preparing all 12 datasets.
Download model weights from official repositories:
VideoMAE, VideoMAE2, All-in-one, UMT, VideoMamba, InternVid
and store in: pretrained/{videomae, videomaev2, umt, videomamba, internvid, allinone}.
The official checkpoints of All-in-one and InternVid require some internal dependencies to load.
We removed those dependencies and provide the cleaned models here:
π Cleaned Checkpoints
We follow VideoGLUE to train a light-weight classifier head on frozen video features.
β‘ Quick Start: Evaluation example (UCF101, VideoMAE-B-K400):
We release a set of trained light-weight classifiers:
π₯ Download models into ./videoglue_models/: GDrive link.
Note:
--debiased_splits_pathis optional
dataset_name='ucf'
prefix='data/UCF101/original/videos/'
data_path='metadata/videoglue/ucf/'
dataset_type='Kinetics_sparse'
debiased_splits_path='UTD_splits/splits_ucf_testlist01.json'
nb_classes=101
backbone_name='videomae-B-K400'
backbone_parameters='--backbone videomae --model vit_base_patch16_224 --num_frames 16 --backbone_checkpoint pretrained/videomae/videomae_checkpoint_pretrain_kin400.pth'
export MASTER_PORT=$((12000 + $RANDOM % 20000))
export OMP_NUM_THREADS=1
export PYTHONPATH="./third_party/:$PYTHONPATH"
output_dir='videoglue_models/' # important for loading checkpoint!!!
srun python utd/videoglue/train_classifier.py \
${backbone_parameters} \
--exp_name ${dataset_name}_${backbone_name} \
\
--data_path ${data_path} \
--prefix ${prefix} \
--data_set ${dataset_type} \
--debiased_splits_path ${debiased_splits_path} \
--split ',' \
--nb_classes ${nb_classes} \
--log_dir 'output/videoglue/logs' \
--output_dir ${output_dir} \
\
--batch_size 64 \
--val_batch_size_mul 0.5 \
--num_sample 1 \
--input_size 224 \
--short_side_size 224 \
--num_workers 16 \
\
--no_test \
--dist_eval \
--enable_deepspeed \
--layer_decay 1.0 \
\
--evalπ§ͺ See scripts/eval_classifier.sh for more examples.
Training example (UCF101, VideoMAE-B-K400):
Note:
--debiased_splits_pathis optional
# Set neptune.ai keys (optional)
neptune_api_token=''
neptune_project=''
dataset_name='ucf'
prefix='data/UCF101/original/videos/'
data_path='metadata/videoglue/ucf/'
dataset_type='Kinetics_sparse'
nb_classes=101
backbone_name='videomae-B-K400'
backbone_parameters='--backbone videomae --model vit_base_patch16_224 --num_frames 16 --backbone_checkpoint pretrained/videomae/videomae_checkpoint_pretrain_kin400.pth'
export MASTER_PORT=12345
export OMP_NUM_THREADS=1
export PYTHONPATH="./third_party/:$PYTHONPATH"
srun python utd/videoglue/train_classifier.py \
${backbone_parameters} \
--exp_name ${dataset_name}_${backbone_name} \
\
--data_path ${data_path} \
--prefix ${prefix} \
--data_set ${dataset_type} \
--debiased_splits_path UTD_splits/splits_ucf_testlist01.json \
--split ',' \
--nb_classes ${nb_classes} \
--log_dir 'output/videoglue/logs' \
--output_dir 'output/videoglue/models' \
\
--ep 50 \
--lr 0.001 \
--weight_decay 0.05 \
--batch_size 64 \
--val_batch_size_mul 0.5 \
--num_sample 1 \
--input_size 224 \
--short_side_size 224 \
--num_workers 16 \
\
--aspect_ratio 0.5 2.0 \
--area_ratio 0.3 1.0 \
--aa rand-m9-n2-mstd0.5 \
--reprob 0 \
--mixup 0.8 \
--cutmix 1.0 \
\
--warmup_epochs 5 \
--opt adamw \
--opt_betas 0.9 0.999 \
\
--no_test \
--dist_eval \
--enable_deepspeed \
--layer_decay 1.0 \
\
--enable_neptune \
--neptune_api_token ${neptune_api_token} \
--neptune_project ${neptune_project}
# disable the last block if you don't want to turn on neptune loggingπ§ͺ We use the same hyperparameters across all models/datasets. See scripts/train_classifier.sh for more examples.
For text-to-video retrieval, we report zero-shot performance and follow model-specific reranking procedures.
Example (msrvtt, umt-b-5M):
Note:
--debiased_splits_pathis optional
export PYTHONPATH="./third_party/:$PYTHONPATH"
python utd/videoglue/eval_retrieval.py \
--backbone_name umt-b-5M \
--backbone umt \
--model vit_b16 \
--backbone_checkpoint pretrained/umt/b16_5m.pth \
--dataset msrvtt \
--split test \
--debiased_splits_path UTD_splits/splits_msrvtt_test.json \
--dataset_root data/msrvtt/ \
--dataset_metadata_root metadata/msrvtt/ \
--encoder_batch_size 32 --fusion_batch_size 32 \
--output_root 'output/retrieval'π§ͺ See scripts/evaluate_retrieval.sh for commands of all models and dataset.
conda create -n utd python=3.10 -y
conda activate utd
pip install -r requirements_utd.txt
cd third_party/LLaVA
pip install -e .
cd -
pip install protobuf==4.24.4 ffmpeg-python==0.2.0 scikit-image opencv_python==4.6.0.66 decord==0.6.0 av==10.0.0 sentence-transformers==2.7.0
pip install -e .In our UTD dataset, we provide textual descriptions for all 12 datasets. For a quick start, you can use them to compute common sense representation bias.
π₯ Download UTD descriptions into ./UTD_descriptions/ following instructions on our project webpage.
Example: compute common sense bias for MSRVTT (objects, seq_of_f aggregation):
Note: use
--retrievalkey only for text-to-video retrieval datasets
dataset="msrvtt"
split="test"
concept="objects"
temporal_aggregation_type="seq_of_f"
python utd/utd/compute_common_sense_bias.py \
--retrieval \
--dataset ${dataset} --split ${split} \
--temporal_aggregation_type ${temporal_aggregation_type} \
--concept ${concept} \
--text_descriptions UTD_descriptions/${dataset}_${split}_${concept}.pickle \
--output_path output/bias/${dataset}_${split}_${concept}_${temporal_aggregation_type}.csvFollow DATASETS.md for instructions on preparing all 12 datasets.
We release full code to generate textual descriptions (namely our UTD-descriptons) and estimate different types of representation bias.
We use LLaVA-1.6-Mistral-7B to generate detailed frame-level descriptions (objects+composition+activities), which include objects, composition, relationships, and activities.
For example, to process the MSRVTT test split:
dataset='msrvtt'
split='test'
dataset_root='data/msrvtt/'
python utd/utd/vlm_descriptions_llava.py \
--dataset ${dataset} --split ${split} --dataset_root ${dataset_root} --metadata_root metadata/${dataset} \
--output_path output/UTD_descriptions/${dataset}_${split}_objects+composition+activities.pickle \π See commands for all datasets in scripts/1_run_vlm.sh.
π‘ Tip: To efficiently process large splits (e.g.,
kinetics_700train), you can use the--num_data_chunksand--chunk_idarguments to process only a specific chunk of the input data in each process and run multiple jobs in parallel, each handling a different portion of the data.
Use the LLM to extract individual concepts such as objects, activities, or verbs.
Step 1: Extract objects or activities:
dataset='msrvtt'
split='test'
concept="objects"
#or concept="activities"
python utd/utd/llm_extract_concepts.py \
--concept ${concept} \
--input_description_path output/UTD_descriptions/${dataset}_${split}_objects+composition+activities.pickle \
--output_path output/UTD_descriptions/${dataset}_${split}_${concept}.pickle \
--save_raw_llm_output_path output/UTD_descriptions/raw_${dataset}_${split}_${concept}.pickleStep 2: Extract verbs from raw activities output (without parsing):
dataset='msrvtt'
split='test'
concept="verbs"
python utd/utd/llm_extract_concepts.py \
--concept ${concept} \
--input_description_path output/UTD_descriptions/raw_${dataset}_${split}_activities.pickle \
--output_path output/UTD_descriptions/${dataset}_${split}_${concept}.pickleStep 3: Generate 15-words summaries of objects+composition+activities concise summaries for sequence-level reasoning in objects+composition+activities setup:
dataset='msrvtt'
split='test'
concept='objects+composition+activities_15_words'
python utd/utd/llm_extract_concepts.py \
--concept ${concept} \
--input_description_path output/UTD_descriptions/${dataset}_${split}_objects+composition+activities.pickle \
--output_path output/UTD_descriptions/${dataset}_${split}_${concept}.pickleπ See commands for all datasets in scripts/2_run_llm.sh.
π‘ Tip: To efficiently process large splits (e.g.,
kinetics_700train), you can use the--num_data_chunksand--chunk_idarguments to process only a specific chunk of the input data in each process and run multiple jobs in parallel, each handling a different portion of the data.
We use SFR-Embedding-Mistral for reasoning over textual descriptions.
Available concepts: objects, activities, verbs, objects+composition+activities, objects+composition+activities_15_words
Aggregation types: middle_f, max_score_f, avg_over_f, seq_of_f
Example: (MSRVTT, objects, seq_of_f):
Note: use
--retrievalkey only for text-to-video retrieval datasets
dataset="msrvtt"
split="test"
concept="objects"
temporal_aggregation_type="seq_of_f"
python utd/utd/compute_common_sense_bias.py \
--retrieval \
--dataset ${dataset} --split ${split} \
--temporal_aggregation_type ${temporal_aggregation_type} \
--concept ${concept} \
--text_descriptions output/UTD_descriptions/${dataset}_${split}_${concept}.pickle \
--output_path output/bias/${dataset}_${split}_${concept}_${temporal_aggregation_type}.csvπ See commands for all datasets in scripts/3_common_sense_bias.sh
We compute dataset bias as the performance of a linear classifier trained on textual embeddings extracted from the training set.
Steps:
- Extract and save text embeddings
- Train a linear classifier on those embeddings
π See the full pipeline in scripts/4_dataset_bias.sh
The code for LLaVA inference is partly derived from LLaVA which is licensed under the Apache License 2.0.
The code in utd/videoglue is partly based on Unmasked Teacher which is licensed under the MIT License.
Licenses for third-party code in ./third_party/ are provided in each corresponding subfolder.
All other code is licensed under the MIT License. Full license terms are available in the LICENSE file.
If you use this code in your research, please cite:
@inproceedings{shvetsova2025utd,
title = {Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks},
author = {Shvetsova, Nina and Nagrani, Arsha and Schiele, Bernt and Kuehne, Hilde and Rupprecht, Christian},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2025}
}
