| CARVIEW |
Select Language
HTTP/2 301
server: GitHub.com
content-type: text/html
location: https://cyberiada.github.io/ViLMA/
access-control-allow-origin: *
expires: Tue, 30 Dec 2025 09:50:43 GMT
cache-control: max-age=600
x-proxy-cache: MISS
x-github-request-id: 7E1C:234FE9:9FA87A:B3283C:69539E1A
accept-ranges: bytes
age: 0
date: Tue, 30 Dec 2025 09:40:43 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210038-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767087643.175779,VS0,VE204
vary: Accept-Encoding
x-fastly-request-id: 61e3f4f9a3b61f617491d6f56e56a4d0ae369176
content-length: 162
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Mon, 08 Dec 2025 11:53:15 GMT
access-control-allow-origin: *
etag: W/"6936bc2b-3770"
expires: Tue, 30 Dec 2025 09:50:43 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 9239:3827E5:9F3AE6:B2CE1A:69539E1B
accept-ranges: bytes
age: 0
date: Tue, 30 Dec 2025 09:40:43 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210038-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767087643.409243,VS0,VE214
vary: Accept-Encoding
x-fastly-request-id: bb0d3437d1aab005713b260da5aba44ba1909a5a
content-length: 3238
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models (ICLR 2024)
Can video-language models process spatio-temporal events sufficiently?
ViLMA (Video Language Model Assessment) presents a comprehensive benchmark for Video-Language Models (VidLMs) to evaluate their linguistic and temporal grounding capabilities in five dimensions: action counting, situation awareness, change of state, rare actions and spatial relations.
ViLMA also includes a two stage evaluation procedure as (i) proficiency test (P) that assesses fundamental capabilities deemed essential before solving the five tests, (ii) main test (T) which evaluates the model under the proposed five diverse tests, and (iii) a combined score of these two tasks (P+T).
ViLMA (Video Language Model Assessment) presents a comprehensive benchmark for Video-Language Models (VidLMs) to evaluate their linguistic and temporal grounding capabilities in five dimensions: action counting, situation awareness, change of state, rare actions and spatial relations.
ViLMA also includes a two stage evaluation procedure as (i) proficiency test (P) that assesses fundamental capabilities deemed essential before solving the five tests, (ii) main test (T) which evaluates the model under the proposed five diverse tests, and (iii) a combined score of these two tasks (P+T).
Paper
For more details about benchmark and experiments, please read our paperour paper. If you find ViLMA beneficial for your research, please cite it,@inproceedings{kesen2023vilma,
title={ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models},
author={Ilker Kesen and Andrea Pedrotti and Mustafa Dogan and Michele Cafagna and Emre Can Acikgoz and Letitia Parcalabescu and Iacer Calixto and Anette Frank and Albert Gatt and Aykut Erdem and Erkut Erdem},
year={2024},
booktitle={International Conference on Learning Representations (ICLR)},
}ViLMA Leaderboard
We aim to maintain an up-to-date leaderboard for the ViLMA benchmark. To make a submission, please either send an email to Ilker Kesen or open an issue. For simplicity, the leaderboard represents only the combined setting (P+T) results using the pairwise accuracy metric (accr). , and icons characterise text-only models, image-language models and video-language models respectively.Authors
Ilker Kesen
Andrea Pedrotti
Mustafa Dogan
Michele Cafagna
Emre Can Acikgoz
Letiția Pârcălăbescu
Iacer Calixto
Anette Frank
Albert Gatt
Aykut Erdem