| CARVIEW |
Seeking and Updating with Live Visual Knowledge
2 University of Washington
3 Zhejiang University
4 University of Illinois, Chicago
- A Novel Dataset. We introduce LIVEVQA, the first dataset of its kind, featuring 107,143 samples across 12 categories, specifically designed to test how MLLMs handle visual information beyond their training data cutoff and how they can be updated with new knowledge.
- Comprehensive Benchmarking and Analysis. We conducted extensive benchmarking of 17 state-of-the-art MLLMs, revealing significant performance gaps on content beyond their knowledge cutoff. Our findings show that tool-use or agentic visual seeking frameworks can drastically improve performance by an average of 327%.
- Efficient Knowledge Updating Insights. We explored parameter-efficient fine-tuning (PEFT) methods, demonstrating that MLLMs can be efficiently updated with new visual knowledge within a single epoch. While this can impact visual perception, it can enhance knowledge-intensive capabilities, and we provide insights into balancing adapter capacity and model capability.
Abstract
LIVEVQA Dataset Construction
- Raw Data Collection from Diverse Sources: This stage involves collecting recent visual and textual data. For news articles, this includes URL and headline filtering, image selection based on size and relevance (enhanced using GPT-4.1 to ensure strong correlation with events), and semantic deduplication. For videos (from YouTube), it involves preprocessing (restricting to English, max 10 mins, with subtitles), subtitle-based segmentation using an LLM, initial keyframe identification (using UVD and perceptual hashing for deduplication), and LLM-driven selection of top-K relevant keyframes. For academic papers (from arXiv), it includes extracting titles, abstracts, authors, images, and captions, followed by key image selection prioritizing architectural diagrams and key findings, avoiding common visualizations.
- Visual Question Answering (VQA) Generation and Filtering: This stage constructs two levels of questions. Level 1 questions target basic visual entity recognition (e.g., locations, persons, time) based on filtered images and metadata, with GPT-4.1 used to filter out unqualified QAs (e.g., those with overly brief answers or simple labels). Level 2 questions are more complex, requiring multi-hop cross-modal reasoning using the full image context and related textual information, covering seven types (location, person, organization, time, event, count, reason); these are also generated and filtered by GPT-4.1 to ensure answer verifiability. All LLM/MLLM-assisted processes undergo human validation with a high pass rate.
Data Filtering
The diagram above illustrates the comprehensive filtering process employed in the construction of the LIVEVQA dataset. It details how raw images and synthesized question-answer pairs are systematically refined across three distinct data source pipelines: YouTube videos, arXiv academic papers, and news articles (from sources such as Forbes, Variety, CNN, BBC, and Associated Press). The pipeline begins with a large corpus of "Raw Images" (e.g., 829K from YouTube, 180K from arXiv, 19K from News). These are then subjected to a series of stringent filtering stages. Key steps include "Key Frame Filters" for video content, "Irrelevant Image Filters" to remove non-pertinent visuals, and "Choose the Most Representative" to select the most informative images. Further refinement occurs through "Level-1 QAs Filters" and "Level-2 QAs Filters", followed by an "AI Judge & Filter QAs" step. This meticulous process significantly reduces the volume of data, ensuring that only high-quality and relevant "Meta Images" and their associated reasoning questions (e.g., culminating in 12K images from YouTube, 9K from arXiv, and 8K from News) are included in the final LIVEVQA dataset. This multi-layered filtering strategy is essential for maintaining the integrity and utility of the benchmark.
Benchmark Display
Example 1: News
Source: CNN Sport
Based on the provided image, what is the specific location where this celebration is taking place?
- A. Augusta National Golf Club
- B. Pebble Beach Golf Links
- C. St Andrews Links
- D. Torrey Pines Golf Course
- E. Pinehurst No. 2
What is the reason the camera operator visible in the green tower above the crowd was able to capture the critical moment shown in the image?
- A. replay producer signaled to hold the shot
- B. technical director delayed switching as ordered
- C. director anticipated the shot outcome
- D. head coach instructed to stay on main feed
Example 2: News
Source: BBC
Based on the provided image, what is the specific location shown?
- A. Trafalgar Square
- B. Hyde Park Corner
- C. Parliament Square
- D. Leicester Square
- E. Piccadilly Circus
Why did the Home Secretary announce the extension of criminal protection to the monument prominently shown in the image?
- A. public demand for more statues
- B. banner slogan matching protest motto
- C. country celebrating VE Day
- D. Parliament Square redevelopment approval
Example 3: Video
Source: YouTube
Based on the provided image, what event is taking place?
- A. Mumbai Tech Startup Expo 2024
- B. MasterSoft Hosts Higher Education Leaders Conclave
- C. Maharashtra Digital Learning Conference
- D. National Policy on Education Review Summit
- E. Nagpur Academic Technology Symposium
How many years was the association mentioned by the principal who took over in 2009 before adopting the solution described by the man at the blue podium?
- A. 2 years
- B. 3 years
- C. 14 years
- D. 4 years
Example 4: Video
Source: YouTube
Based on the provided image, what event is taking place?
- A. Venice Film Festival
- B. Cannes Film Festival
- C. Sundance Film Festival
- D. Berlin International Film Festival
- E. Toronto International Film Festival
What is the name of the event at which the audience shown in the image is present?
- A. Academy Awards Ceremony
- B. Venice Film Festival
- C. Cannes Film Festival
- D. Berlin International Film Festival
Example 5: Academic Paper
Source: Arxiv
Who is the primary author of the paper shown here?
- A. R. J. Smethurst
- B. Hugh Dickinson
- C. L. F. Fortson
- D. Tobias Géron
- E. Izzy L. Garland
In this paper, for the sample of 6,640 galaxies that remained after the deduplication process, and based on the described classification scheme (where a galaxy is unbarred if p_strong_bar + p_weak_bar < 0.5), how many galaxies were ultimately classified as unbarred?
- A. 311
- B. 161
- C. 6640
- D. 6479
- E. 398
Example 6: Academic Paper
Source: Arxiv
Who conducted the research presented in this image?
- A. Yupeng Zhang
- B. Mridul Sharma
- C. Prajwal Thapa
- D. Jinu Nyachhyon
- E. Yagya Raj Pandeya
In this paper, what is the precise count of distinct, pre-trained architectural frameworks that the researchers explicitly selected, then uniformly adapted at their terminal processing stage for the 60-class herb identification problem, and subsequently benchmarked against one another?
- A. 1
- B. 5
- C. 6
- D. 60
- E. 121
LIVEVQA Dataset Statistics
| Category | Images | #Question | Level 1 | Level 2 | Avg. Len. | Purpose |
|---|---|---|---|---|---|---|
| News Article | 7,579 | 38,809 | 7,579 | 31,230 | 749 | - |
| YouTube Videos | 11,948 | 43,168 | 11,948 | 31,220 | 311 | - |
| Academic Paper | 8,961 | 25,166 | 9,456 | 16,205 | 597 | - |
| Avg. per Sample | 1 | 3.86 | 1 | 2.86 | 517 | - |
| Test Split | 1,500 | 3,000 | 1,500 | 1,500 | 544 | Exp. 1 |
| Training Split | 26,988 | 104,143 | 26,988 | 77,150 | 496 | Exp. 2 |
Figure 4: (Left) Image size distribution in YouTube image filtering pipeline. (Right) Textual context length distribution for each question.
Benchmark Results for LIVEVQA
| Model | Cutoff | Level 1 | Level 2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| News | Video | Arxiv | Avg. | News | Video | Arxiv | Avg. | ||
| w.o. Search | |||||||||
| GPT-4.1 | Jun. 2024 | 27.0 | 22.0 | 0.4 | 16.5 | 5.2 | 7.2 | 0.2 | 3.0 |
| GPT-4.1-mini | Jun. 2024 | 24.6 | 19.6 | 0.2 | 14.8 | 4.0 | 7.8 | 0.4 | 4.0 |
| GPT-4.1-nano | Jun. 2024 | 13.0 | 13.0 | 0.0 | 8.6 | 2.2 | 6.0 | 0.4 | 2.9 |
| Gemini-2.5-Flash | Jan. 2025 | 25.8 | 18.4 | 0.8 | 15.0 | 4.6 | 4.4 | 4.0 | 4.3 |
| Gemini-2.5-Pro | Jan. 2025 | 28.0 | 17.4 | 0.6 | 15.3 | 4.4 | 2.4 | 1.2 | 2.7 |
| Gemma-3-27B-It | Aug. 2024 | 21.0 | 16.4 | 1.0 | 12.8 | 3.8 | 4.6 | 6.2 | 4.9 |
| Claude-3.7-Sonnet | Oct. 2024 | 26.2 | 16.4 | 0.6 | 14.3 | 2.2 | 4.4 | 4.4 | 3.7 |
| Qwen-2.5-VL-7B-Instruct | Unknown | 20.2 | 13.4 | 0.2 | 11.3 | 3.8 | 5.4 | 2.0 | 3.7 |
| Qwen-2.5-VL-32B-Instruct | Unknown | 25.2 | 16.4 | 0.4 | 14.0 | 4.2 | 5.6 | 1.2 | 3.7 |
| Qwen-2.5-VL-72B-Instruct | Unknown | 12.4 | 9.4 | 0.0 | 7.3 | 1.4 | 3.6 | 3.6 | 2.9 |
| Llama-4-Scout | Aug. 2024 | 20.6 | 16.4 | 0.0 | 12.1 | 4.0 | 5.0 | 2.8 | 3.9 |
| Llama-4-Maverick | Aug. 2024 | 20.2 | 19.0 | 0.6 | 13.3 | 5.8 | 6.0 | 5.2 | 5.7 |
| w. Text Search | |||||||||
| GPT-4.1 | Jun. 2024 | 25.0 | 21.4 | 0.6 | 15.6 | 3.6 | 5.6 | 3.8 | 4.3 |
| Gemini-2.5-Pro | Jan. 2025 | 17.6 | 9.2 | 0.2 | 9.0 | 2.0 | 1.6 | 1.0 | 1.5 |
| Claude-3.7-Sonnet | Oct. 2024 | 24.6 | 16.6 | 0.0 | 13.7 | 2.0 | 3.6 | 4.8 | 3.5 |
| w. Native Image Search | |||||||||
| GPT-03 | Jun. 2024 | 33.6 | 33.6 | 2.6 | 23.3 | 14.6 | 14.9 | 17.8 | 15.8 |
| w. MM-Search [Jiang et al., 2024] | |||||||||
| GPT-4.1 | Jun. 2024 | 42.0 | 36.1 | 22.0 | 33.4 | 27.2 | 15.2 | 48.8 | 30.4 |
| Model | Level 1 (News Subset) | Level 2 (News Subset) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Loc. | Per. | Org. | Eve. | Obj. | Avg. | Loc. | Per. | Org. | Time | Cou. | Rea. | Eve. | Avg. | |
| w.o. Search | ||||||||||||||
| GPT-4.1 | 50.72 | 15.19 | 35.89 | 27.03 | 6.28 | 28.81 | 0.00 | 1.75 | 11.68 | 3.82 | 7.84 | 1.63 | 0.00 | 5.05 |
| GPT-4.1-mini | 33.33 | 10.91 | 45.59 | 11.86 | 19.23 | 24.60 | 0.00 | 3.57 | 8.82 | 0.00 | 10.24 | 0.00 | 0.00 | 4.00 |
| GPT-4.1-Nano | 16.16 | 3.64 | 30.88 | 3.39 | 13.00 | 13.00 | 0.00 | 0.00 | 4.41 | 1.54 | 3.94 | 0.83 | 0.00 | 2.20 |
| Gemini-2.5-Flash | 26.26 | 37.27 | 35.29 | 7.63 | 25.80 | 25.80 | 0.00 | 3.57 | 1.47 | 3.85 | 8.66 | 4.17 | 0.00 | 4.60 |
| Gemini-2.5-Pro | 23.23 | 46.36 | 35.29 | 10.17 | 28.00 | 28.00 | 3.57 | 0.00 | 5.88 | 3.08 | 3.94 | 6.67 | 0.00 | 4.40 |
| Gemma-3-27B-IT | 24.24 | 15.45 | 38.24 | 8.47 | 21.00 | 21.00 | 3.57 | 0.00 | 8.82 | 1.54 | 7.87 | 0.00 | 0.00 | 3.80 |
| Claude-3.7-Sonnet | 26.20 | 38.38 | 10.00 | 14.41 | 26.20 | 26.20 | 0.00 | 0.00 | 4.41 | 2.31 | 1.57 | 2.50 | 0.00 | 2.20 |
| Qwen-2.5-VL-7B | 23.23 | 21.15 | 30.88 | 12.71 | 20.20 | 20.20 | 0.00 | 0.00 | 4.41 | 1.54 | 7.09 | 4.17 | 0.00 | 3.80 |
| Qwen-2.5-VL-32B | 33.33 | 18.18 | 30.88 | 18.64 | 25.20 | 25.20 | 0.00 | 0.00 | 7.35 | 2.31 | 6.30 | 4.17 | 0.00 | 4.20 |
| Qwen-2.5-VL-72B | 12.50 | 6.36 | 15.15 | 8.47 | 12.40 | 12.40 | 0.00 | 0.00 | 4.41 | 0.77 | 1.57 | 0.83 | 0.00 | 1.40 |
| Llama-4-Scout | 26.26 | 13.64 | 35.29 | 8.47 | 20.60 | 20.60 | 3.57 | 0.00 | 4.41 | 3.08 | 9.45 | 0.00 | 0.00 | 4.00 |
| Llama-4-Maverick | 20.20 | 19.09 | 36.76 | 5.93 | 20.20 | 20.20 | 0.00 | 0.00 | 10.29 | 2.31 | 13.39 | 1.67 | 0.00 | 5.80 |
| w. Text Search | ||||||||||||||
| GPT-4.1 | 34.62 | 13.56 | 48.53 | 2.73 | 25.00 | 25.00 | 5.88 | 3.57 | 5.88 | 3.85 | 4.72 | 0.83 | 0.00 | 3.60 |
| Gemini-2.5-Pro | 18.18 | 10.17 | 29.41 | 12.73 | 17.60 | 17.60 | 0.00 | 3.57 | 4.41 | 1.54 | 2.36 | 1.67 | 0.00 | 2.00 |
| Claude-3.7-Sonnet | 23.08 | 18.64 | 40.38 | 6.36 | 24.60 | 24.60 | 0.00 | 5.88 | 1.47 | 1.54 | 3.15 | 0.83 | 0.00 | 2.00 |
| w. Native Image Search | ||||||||||||||
| GPT-03 | 47.47 | 23.73 | 57.35 | 47.12 | 33.60 | 33.60 | 0.00 | 17.86 | 20.59 | 7.69 | 17.32 | 17.50 | 10.00 | 14.60 |
| w. MM-Search [Jiang et al., 2024] | ||||||||||||||
| GPT-4.1 | 50.00 | 35.78 | 55.88 | 42.86 | 42.00 | 42.00 | 15.50 | 23.53 | 30.88 | 42.52 | 20.00 | 46.43 | 0.00 | 27.20 |
| Model | Level 1 (Video Subset) | Level 2 (Video Subset) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Loc. | Per. | Org. | Eve. | Obj. | Avg. | Loc. | Per. | Org. | Time | Cou. | Rea. | Eve. | Avg. | |
| w.o. Search | ||||||||||||||
| GPT-4.1 | 26.58 | 8.33 | 40.85 | 7.77 | 32.23 | 22.00 | 8.51 | 3.45 | 5.56 | 6.32 | 11.20 | 5.65 | 4.55 | 7.20 |
| GPT-4.1-mini | 21.52 | 13.54 | 30.99 | 4.85 | 30.58 | 19.60 | 2.13 | 3.45 | 12.96 | 6.32 | 15.20 | 3.23 | 4.55 | 7.80 |
| GPT-4.1-nano | 15.19 | 1.04 | 28.17 | 4.85 | 19.01 | 13.00 | 0.00 | 0.00 | 5.56 | 6.32 | 14.40 | 2.42 | 0.00 | 6.00 |
| Gemini-2.5-Flash | 18.99 | 27.08 | 29.58 | 4.85 | 18.18 | 18.40 | 0.00 | 3.45 | 1.85 | 4.21 | 11.20 | 0.81 | 4.55 | 4.40 |
| Gemini-2.5-Pro | 8.86 | 25.00 | 32.39 | 6.80 | 19.01 | 17.40 | 0.00 | 0.00 | 1.85 | 2.11 | 5.60 | 1.61 | 0.00 | 2.40 |
| Gemma-3-27B-IT | 13.92 | 14.58 | 33.80 | 3.88 | 21.49 | 16.40 | 0.00 | 0.00 | 5.56 | 4.21 | 10.40 | 1.61 | 4.55 | 4.60 |
| Claude-3.7-Sonnet | 18.99 | 7.29 | 29.58 | 6.80 | 23.97 | 16.40 | 2.13 | 0.00 | 1.85 | 4.21 | 7.20 | 4.84 | 4.55 | 4.40 |
| Qwen-2.5-VL-7B | 12.66 | 10.42 | 25.35 | 4.85 | 16.53 | 13.40 | 2.13 | 0.00 | 5.56 | 3.16 | 14.40 | 1.61 | 0.00 | 5.40 |
| Qwen-2.5-VL-32B | 16.46 | 10.42 | 32.39 | 4.85 | 22.31 | 16.40 | 0.00 | 0.00 | 5.56 | 6.32 | 9.60 | 4.84 | 4.55 | 5.60 |
| Qwen-2.5-VL-72B | 10.13 | 3.12 | 18.31 | 1.94 | 14.88 | 9.40 | 0.00 | 0.00 | 7.41 | 3.16 | 5.60 | 2.42 | 4.55 | 3.60 |
| Llama-4-Scout | 16.46 | 13.54 | 26.76 | 7.77 | 20.66 | 16.40 | 2.13 | 0.00 | 7.41 | 4.21 | 10.40 | 1.61 | 4.55 | 5.00 |
| Llama-4-Maverick | 18.99 | 14.58 | 38.03 | 8.74 | 20.66 | 19.00 | 2.13 | 3.45 | 3.70 | 4.21 | 15.20 | 2.42 | 0.00 | 6.00 |
| w. Text Search | ||||||||||||||
| GPT-4.1 | 13.92 | 6.25 | 30.05 | 3.56 | 22.59 | 14.60 | 2.84 | 0.00 | 3.09 | 3.86 | 6.67 | 2.42 | 3.03 | 3.73 |
| Gemini-2.5-Pro | 1.69 | 1.39 | 19.72 | 2.91 | 8.54 | 6.53 | 0.00 | 0.00 | 0.62 | 1.40 | 3.20 | 0.00 | 1.52 | 1.20 |
| Claude-3.7-Sonnet | 8.02 | 4.17 | 14.55 | 2.59 | 12.95 | 8.33 | 1.42 | 0.00 | 1.23 | 1.40 | 3.73 | 0.54 | 0.00 | 1.60 |
| w. Native Image Search | ||||||||||||||
| GPT-o3 | 37.97 | 19.79 | 43.66 | 22.33 | 46.28 | 33.60 | 8.51 | 10.34 | 12.96 | 11.58 | 29.60 | 25.00 | 18.18 | 19.40 |
| w. MM-Search [Jiang et al., 2024] | ||||||||||||||
| GPT-4.1 | 29.11 | 31.58 | 49.30 | 21.36 | 38.84 | 33.00 | 13.68 | 17.02 | 10.34 | 11.11 | 26.40 | 9.68 | 4.55 | 15.20 |
Empirical Results from LIVEVQA
MLLMs Face Challenges with "Live" Visual Knowledge; Multimodal Search is Key
Our comprehensive benchmarking of 17 state-of-the-art Multimodal Large Language Models (MLLMs) on the LIVEVQA dataset revealed significant difficulties in handling visual information beyond their knowledge cutoff dates. For instance, even top-performing models showed low accuracy on recent visual content when operating without external tools.
However, the integration of multimodal search capabilities leads to dramatic improvements.
- Models augmented with multimodal search tools (e.g., GPT-4.1 with MM-Search) demonstrated an average accuracy increase of 327% in seeking live visual knowledge. Specifically, GPT-4.1's average accuracy more than doubled from 16.5% to 33.4% when using MM-Search, with particularly striking gains on challenging Level 2 questions (e.g., accuracy on News subset Level 2 rose from 5.2% to 27.2%).
- Native image search capabilities, as seen in models like GPT-03, also provided substantial gains (e.g., from 3.0% to 15.8% on Level 2 questions). In contrast, simple text-based online searching did not yield significant improvements, underscoring the necessity of multimodal retrieval for dynamic visual information.
Efficiently Updating MLLMs with New Visual Knowledge via PEFT
The research explored updating MLLMs with new visual knowledge using Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and DoRA.
- Rapid Adaptation: Visual information can be efficiently updated through fine-tuning within only one epoch. Models using direct multiple-choice questions with concise answers (MCQA format) yielded faster and more effective learning during the visual knowledge acquisition phase compared to other formats like QA (Question + Ground Truth) or QAR (Question + Ground Truth + Reasoning).
- LoRA Rank Impact: Higher rank LoRA configurations consistently enhanced visual knowledge capabilities, particularly in assimilating recent visual entities. Models with higher ranks outperformed lower-rank counterparts by an average of 5.4% on the validation subset.
- Benefit to General Reasoning: Training on the visually knowledge-intensive LIVEVQA dataset—particularly with straightforward answers and multiple-choice questions—led to a notable 4.2% improvement on the general multimodal reasoning benchmark MMMU.
Knowledge Updating Presents Trade-offs: Enhanced Reasoning vs. Degraded Perception
While PEFT methods allow for efficient incorporation of new visual facts, this process is not without its challenges and trade-offs.
A consistent observation was the degradation in the model's foundational visual perception capabilities (as measured by the MMStar benchmark) after undergoing intensive visual knowledge updates, regardless of rank, training steps, or data formats. For example, models trained using the simple QA format exhibited a performance drop on MMStar from 65.80% to 58.16%. This suggests an inherent conflict between enhancing specific visual knowledge through intensive updates and preserving the model's broader visual perception abilities.
Model Scale Correlates with Performance, but Calibration Remains a Challenge
The benchmark results highlighted several aspects regarding model characteristics:
- Larger Models Tend to Perform Better: For models sharing the same knowledge cutoff (e.g., the GPT-4.1 family), increased model size generally correlated with improved accuracy on LIVEVQA tasks across all difficulty levels. Proprietary models also typically maintained an advantage over open-source counterparts.
- Overconfidence and Calibration Issues: A crucial finding was the positive correlation between stated confidence and accuracy across models, but with significant calibration issues. All evaluated MLLMs demonstrated a consistent pattern of overconfidence in their visual factuality assessments, with their performance falling significantly below the ideal calibration line. While larger models like GPT-4.1 showed comparatively better calibration than their smaller variants, substantial opportunities remain for improving MLLM calibration when encountering unknown visual knowledge.
- Level 2 Questions Prove More Difficult: As anticipated, Level 2 questions, which require deeper cross-modal reasoning, generally resulted in significantly lower performance for models compared to Level 1 (visual entity recognition) questions across most data subsets (News, Video).
BibTeX
@article{fu2025livevqa,
title={LiveVQA: Live Visual Knowledge Seeking},
author={Fu, Mingyang and Peng, Yuyang and Liu, Benlin and Wan, Yao and Chen, Dongping},
journal={arXiv preprint arXiv:2504.05288},
year={2025}
}
LIVEVQA Team