| CARVIEW |
Mementos
A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
2 UNC-Chapel Hill, Chapel Hill
*Indicates Equal Advising
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs’ sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs’ sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations.
Leaderboard
Recall. Precision, and F1 scores of Object and Behavior on Val set of
Mementos.
| # | Model | Input type | Source | Date | Avg | Object-Recall | Object-Precision | Object-F1 | Behavior-Recall | Behavior-Precision | Behavior-F1 |
| 1 | GPT-4V 🥇 | Sequential | Link | 2024-01-20 | 45.68 | 60.24 | 54.13 | 55.36 | 42.36 | 29.40 | 32.58 |
| 2 | Gemini 🥈 | Sequential | Link | 2024-01-20 | 33.98 | 38.36 | 43.12 | 38.91 | 26.28 | 31.01 | 26.18 |
| 3 | LLaVA-1.5 🥉 | Combined | Link | 2024-01-20 | 32.78 | 36.90 | 46.14 | 39.29 | 22.09 | 29.22 | 23.01 |
| 4 | Chat-UniVi | Sequential | Link | 2024-01-20 | 31.69 | 39.09 | 38.26 | 37.06 | 25.36 | 26.67 | 23.74 |
| 5 | Gemini | Combined | Link | 2024-01-20 | 30.44 | 33.28 | 39.47 | 34.42 | 26.76 | 25.38 | 23.33 |
| 6 | GPT-4v | Combined | Link | 2024-01-20 | 30.13 | 35.41 | 36.34 | 34.46 | 30.70 | 20.82 | 23.07 |
| 7 | mPLUG_Owl-v2 | Combined | Link | 2024-01-20 | 28.26 | 28.51 | 40.65 | 32.20 | 19.74 | 27.81 | 20.64 |
| 8 | InstructBLIP | Combined | Link | 2024-01-20 | 27.10 | 27.37 | 33.86 | 28.77 | 23.98 | 25.69 | 22.92 |
| 9 | Chat-UniVi | Combined | Link | 2024-01-20 | 25.67 | 30.14 | 32.24 | 29.86 | 20.32 | 21.97 | 19.52 |
| 10 | Video-LLaMA-2 | Sequential | Link | 2024-01-20 | 21.13 | 25.59 | 23.50 | 23.35 | 16.21 | 21.47 | 16.62 |
| 11 | MiniGPT4 | Combined | Link | 2024-01-20 | 18.73 | 25.33 | 17.95 | 20.01 | 16.02 | 17.82 | 15.26 |
| 12 | MiniGPT5 | Combined | Link | 2024-01-20 | 18.28 | 24.58 | 17.69 | 19.44 | 15.04 | 17.93 | 15.02 |
💡 Sequential means frames from the image sequence are input sequentially for reasoning.
💡 Combined means combining all frames from an image sequence into one composite image as MLLM input.
🚨 To submit your results to the leaderboard, please send to this email with your result json files.
🚨 For more submission details, please refer to Evaluation.
Mementos Dataset
Overview
Mementos is a comprehensive benchmark designed to evaluate the reasoning capability of Multimodal Large Language Models (MLLMs) over image sequences.
It includes 4,761 image sequences of varying lengths. The image sequences in Mementos are categorized into three domains: Daily-life, Robotics, and Comics.
This diverse collection is crucial for evaluating the comprehensive time-varying reasoning abilities of MLLMs.
Specifically, the robotics data, closely associated with embodied AI or real-world contexts, and the comic-style storyboard data, rich in stylistic and episodic diversity, significantly enhance the benchmark’s relevance and robustness.
Examples of hallucinations by GPT-4V in three domains on
Mementos: Daily-life, Robotics, and Comics.
Detailed evaluation results of different MLLMs on different domains of
Mementos.
Dataset Statistics
All the data are divided into training and validation sets.
- training: 4,062 image sequences used for MLLM training or finetuning.
- validation: 699 image sequences for evaluation.
Detailed image sequence numbers in each domain of Dataset.
Distribution of image sequence length in
Mementos Val set .
Distribution of episode length in
Mementos Val set.
GPT-4-assisted Evaluation
We employ a GPT-4-assisted evaluation procedure: after an MLLM produces a description for an image sequence, we extract behavior and object keywords from both AI-generated and human-annotated descriptions using GPT-4, then use keyword matching to assess the degree of behavioral and object hallucinations.
Evaluation Results on Existing MLLMs
GPT-4V with s-input demonstrates the best reasoning capability compared with all other MLLMs in understanding image sequences. Among open-source models, LLaVA1.5 performs the best, nearly matching or even surpassing the black-box model Gemini in object comprehension, but its ability to infer behaviors from image sequences is weaker compared to Gemini and GPT-4V.
Citation
If you find our work useful, please consider citing the paper as follows:
@misc{wang2024mementos,
title={Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences},
author={Xiyao Wang and Yuhang Zhou and Xiaoyu Liu and Hongjin Lu and Yuancheng Xu and Feihong He and Jaehong Yoon and Taixi Lu and Gedas Bertasius and Mohit Bansal and Huaxiu Yao and Furong Huang},
year={2024},
eprint={2401.10529},
archivePrefix={arXiv},
primaryClass={cs.CV}
}