CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Wed, 10 Sep 2025 14:11:32 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"68c18714-137f6" expires: Mon, 29 Dec 2025 11:32:07 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: CF6C:292AC1:8B7170:9C9464:6952645F accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 11:22:07 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210042-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767007328.682919,VS0,VE217 vary: Accept-Encoding x-fastly-request-id: 34608e90b524f0e6a64e0e6952d9b721b5d283ba content-length: 14469 GUI-World: A Dataset for GUI-Oriented Multimodal LLM-based Agents

GUI-World: A Dataset for GUI-Oriented Multimodal Large Language Models

Dongping Chen¹^*^†, Yue Huang²^*, Siyuan Wu ¹^*, Jingyu Tang¹^*, Liuyi Chen¹, Yilin Bai¹, Zhigang He¹, Chenlong Wang¹, Huichi Zhou¹, Yiqiang Li¹, Tianshuo Zhou¹, Yue Yu¹, Chujie Gao¹, Qihui Zhang¹, Yi Gui¹, Zhen Li¹, Yao Wan¹^†, Pan Zhou¹, Jianfeng Gao³, Lichao Sun⁴

¹ONE Lab, Huazhong University of Science and Technology,
²University of Notre Dame, ³Microsoft Research, ⁴Lehigh University

{dongpingchen0612, yaowan1992}@gmail.com

Paper Code Data Model

In this work, we introduce a comprehensive GUI-oriented dataset, GUI-World, to benchmark and enhance GUI understanding capabilities. Specifically, there are three-fold major contributions:

A Dataset. We propose GUI-WORLD, a comprehensive GUI dataset comprising over 12,000 videos specifically designed to assess and improve the GUI understanding capabilities of MLLMs, spanning a range of categories and scenarios, including desktop, mobile, and extended reality (XR), and representing the first GUI-oriented instruction-tuning dataset in the video domain.
A Novel Model. Based on GUI-WORLD, we propose GUI-Vid, a GUI-oriented VideoLLM with enhanced capabilities to handle various and complex GUI tasks. GUI-Vid shows a significant improvement on the benchmark and achieves results comparable to the top-performing models
Comprehensive Experiments and Valuable Insights. Our experiments indicate that most existing MLLMs continue to face challenges with GUI-oriented tasks, particularly in sequential and dynamic GUI content. Empirical findings suggest that improvements in vision perception, along with an increase in the number of keyframes and higher resolution, can boost performance in GUI-oriented tasks, thereby paving the way for the future of GUI agents.

Abstract

Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static environments and are predominantly applied in relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-World, we take the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using VideoLLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding.

GUI-World Dataset Construction

We introduce **GUI-World**, a comprehensive dataset covering six GUI scenarios including video, human-annotated keyframes, as well as detailed captions and diverse types of QA produced by our data curation framework, aiming at benchmarking and enhancing the general GUI-oriented capabilities. These GUI scenarios encompass desktop operating systems (e.g., macOS, Windows) and mobile platforms (e.g., Android and iOS), websites, software, and even extended-range technologies (XR) (e.g., GUI in Apple Vision Pro). Discussion for each scenario are in Six Main GUI Categories. The construction of the **GUI-World** dataset mainly follows a two-stage process:

GUI Video Collection and Image Sequence Process: During this phase, a group of 24 undergraduate and graduate students manually collects GUI-related videos from YouTube or manual screen recording. Subsequently, these students use video editing software to transform the videos into short video clips, each containing various human operations on GUI content, and then annotate them with detailed operational descriptions.
Diversify Types of QA through MLLM-Human Collaboration: Given that human annotations might contain grammar errors or unclear statements, we utilize the MLLM, specifically GPT-4V, to first refine the descriptions of the image sequences and then generate various types of QA focusing on static and dynamic GUI content, aiming at comprehensively tasking MLLMs with their GUI-oriented abilities. Finally, all MLLM-generated content will be carefully reviewed through human verification to ensure alignment with original human aims.

Data Statistics and Comparison

Below we present an overview of the main statistics of **GUI-World**, showcasing the outline and a broad spectrum of tasks. **GUI-World** contains a total of 12k videos and 100k queries. We make a comparison of **GUI-World** against some other different benchmarks or dataset in GUI domain as presented below.

	GUI-WORLD
Instances	12,379
Sem.	Both
VL	✔️
Video	✔️
Web	✔️
Mob.	✔️
Desk.	✔️
XR	✔️
Sequential	✔️
CrossApp	✔️
Dynamic	✔️
Detailed Tasks	GUI Understanding Instruction Following

AgentStudio	OSWorld	UGIF	AitW	Mind2Web	Rico	FerretUI	WebArena	MetaGUI	MiniWoB++	OmniAct	MMINA
304	369	523	715,142	2,350	72,219	123,702	812	1,125	100	9,802	1,050
High	High	High	High	Both	Low	Low	Low	Low	Low	Low	Low
✔️	✔️	❌	❌	✔️	✔️	✔️	❌	❌	❌	✔️	✔️
✔️	✔️	❌	❌	✔️	✔️	❌	❌	❌	❌	❌	✔️
✔️	✔️	❌	✔️	✔️	❌	✔️	✔️	✔️	✔️	✔️	✔️
❌	❌	✔️	❌	❌	✔️	✔️	✔️	✔️	✔️	❌	✔️
❌	❌	❌	❌	❌	❌	✔️	❌	❌	❌	✔️	❌
❌	✔️	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️
✔️	✔️	❌	✔️	✔️	✔️	✔️	✔️	✔️	❌	✔️	✔️
❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
General Control	General Control	UI Grounded Instruction Following	GUI Understanding	Web Navigation	UI Code/Layout Generation	UI Grounding & Understanding	Web Navigation	Mobile Navigation	Web Navigation	Code Generation	Web Navigation

Benchmark

We conduct evaluations on four of the most robust image-based MLLMs: GPT-4V(ision), GPT-4o, Qwen-VL-Max, and Gemini-Pro-1.5. We benchmark on three keyframe selection settings: (1) *Random*, where frames are sampled at fixed time intervals within a video; (2) *Extracted*, with keyframes extracted using [Katna](https://github.com/keplerlab/katna); and (3) *Human*, where keyframes are selected by humans during the annotation process. For the *Random* and *Extracted* settings, we input 10 frames into each MLLM, while the *Human* setting uses an average of 6.719 frames. Each model's responses employ a three-step Chain-of-Thought (CoT) process, i.e., "Describe-Analyze-Answer", to evaluate their peak performance. Additionally, we assessed three advanced VideoLLMs—ChatUnivi, Minigpt4-video, and Videochat2—for their performance on GUI content. To assess free-form questions and multiple-round conversations, we utilize the LLM-as-a-Judge methodology, which assigns a similarity score ranging from 1 to 5 between MLLM's response and a predefined golden answer. For a comprehensive evaluation, we also provide BLEU and BERTScore in our paper. For multiple-choice questions, we measure performance using accuracy as the primary evaluation metric. **We are actively updating the benchmark with new LLMs, VLMs and methods. Pull requests welcomed!**

Overall
Software
GUI Vid
GPT-4o

Models	Setting	Software	Website	XR	Multi	IOS	Android	Avg.
ImageLLMs	Gemini-Pro-1.5	R.	81.7%	3.339	82.6%	3.452	81.2%	3.154	81.2%	2.959	82.0%	3.213	81.6%	3.220	81.7%	3.223
E.	78.5%	3.152	77.8%	3.215	80.8%	3.006	71.8%	2.777	79.3%	3.007	78.5%	3.168	77.8%	3.054
Qwen-VL-Max	R.	74.9%	2.676	76.9%	2.656	74.2%	2.469	68.8%	2.432	75.4%	2.779	73.7%	2.309	74.0%	2.553
E.	74.3%	2.624	75.8%	2.627	69.0%	2.499	64.8%	2.362	77.4%	2.659	65.8%	2.277	71.2%	2.508
H.	75.8%	2.651	75.5%	2.698	77.6%	2.373	66.9%	2.490	74.3%	2.633	-	-	74.0%	2.569
GPT-4V	R.	81.5%	3.589	80.9%	3.648	82.4%	3.200	75.0%	3.452	82.5%	3.614	78.3%	3.515	79.8%	3.503
E.	85.1%	3.407	80.1%	3.433	81.8%	2.892	81.9%	3.219	86.4%	3.427	79.9%	3.176	82.6%	3.259
H.	86.0%	3.520	79.8%	3.655	83.4%	3.200	76.9%	3.449	79.9%	3.453	-	-	81.2%	3.469
D.C.	85.0%	3.350	83.1%	3.658	82.3%	3.065	84.2%	3.358	81.6%	3.358	81.7%	3.427	83.0%	3.316
C.C.	80.7%	3.028	72.2%	3.160	76.5%	2.868	76.4%	2.939	78.3%	2.751	81.7%	3.160	78.3%	2.971
H.+D.C.	82.5%	3.494	83.2%	3.682	85.9%	3.191	83.9%	3.617	80.9%	3.516	84.9%	3.758	83.5%	3.543
GPT-4o	H.	86.5%	3.644	83.3%	3.740	84.3%	3.285	81.1%	3.654	83.3%	3.558	90.0%	3.561	84.8%	3.573
VideoLLMs	ChatUnivi	-	28.4%	2.389	22.2%	2.349	20.6%	2.161	17.5%	2.275	22.6%	2.337	23.0%	2.390	22.4%	2.317
Minigpt4Video	-	18.9%	1.475	15.3%	1.520	16.3%	1.362	15.4%	1.457	20.1%	1.501	14.6%	1.342	16.8%	1.443
VideoChat2	-	45.5%	2.144	42.6%	2.221	44.0%	2.005	40.4%	2.222	40.2%	2.169	44.7%	2.119	42.9%	2.147
GUI-Vid	-	59.9%	2.847	54.1%	2.957	55.6%	2.764	52.9%	2.861	51.8%	2.773	53.4%	2.572	54.6%	2.796

Models	Setting	Caption	Complex Tasks	Conversation	Average
ImageLLMs	Gemini-Pro-1.5	R.	3.659	2.837	2.969	2.822	3.450	3.608	3.845	3.339
E.	3.350	2.468	2.741	2.431	3.292	3.458	3.837	3.152
Qwen-VL-Max	R.	2.381	1.758	2.277	2.144	2.724	3.125	3.317	2.676
E.	2.459	1.693	2.143	1.954	2.742	3.174	3.298	2.624
H.	2.474	1.711	2.137	2.032	2.834	3.223	3.257	2.651
GPT-4V	R.	3.579	2.676	3.243	3.011	3.630	3.925	4.131	3.589
E.	3.141	2.301	2.927	2.627	3.541	3.844	4.103	3.407
H.	3.352	2.509	3.053	2.849	3.609	3.928	4.163	3.520
C.C.	3.454	2.547	1.818	2.335	3.577	3.521	3.884	3.028
D.C.	3.412	2.627	2.603	2.591	3.723	3.759	4.072	3.350
H.+D.C.	3.436	2.677	2.927	2.750	3.791	3.857	4.148	3.494
GPT-4o	H.	4.048	3.028	3.125	3.117	3.562	4.129	4.318	3.644
VideoLLMs	ChatUnivi	-	1.587	1.240	1.705	1.656	2.524	2.698	3.366	2.389
Minigpt4Video	-	1.246	1.073	1.249	1.235	1.675	1.494	1.719	1.475
VideoChat2	-	1.992	1.312	1.812	1.682	2.158	2.342	2.720	2.144
GUI-Vid	-	3.562	2.058	2.376	2.090	3.435	3.080	3.260	2.847

Notice: F.K. = Keyframes in Fintuning, E.K. = Keyframes in Evaluation, I. = Image, V. = Video, MC = Multiple-Choice QA, Free = Free-Form QA

Setting	F.K.	E.K.	Data		Software		Website		XR		Multi		IOS		Android		Avg.
Setting	F.K.	E.K.	I.	V.	MC	Free	MC	Free	MC	Free	MC	Free	MC	Free	MC	Free	MC	Free
Baseline	-	8	-	-	45.5%	2.144	42.6%	2.221	44.0%	2.005	40.4%	2.222	40.2%	2.169	44.7%	2.119	42.9%	2.147
Baseline	-	16	-	-	45.1%	2.144	41.8%	2.240	41.0%	2.007	40.7%	2.238	39.9%	2.138	44.7%	2.147	42.2%	2.154
GUI-Vid	8	8	✖	✔	58.3%	2.709	53.6%	2.817	62.2%	2.626	54.2%	2.627	53.1%	2.708	54.9%	2.501	56.0%	2.665
		8	✔	✔	59.9%	2.856	54.1%	2.925	59.0%	2.751	52.1%	2.837	50.0%	2.756	54.0%	2.571	54.8%	2.782
		16	✖	✔	59.0%	2.709	55.1%	2.821	62.8%	2.645	53.3%	2.624	55.5%	2.727	55.7%	2.501	56.9%	2.671
		16	✔	✔	59.9%	2.847	54.1%	2.957	55.6%	2.764	52.9%	2.861	51.8%	2.772	53.4%	2.572	54.6%	2.796

Res.	Desc.	Conv.	Dyn.	Static	Caption	Average
Low	2.794	3.912	3.150	2.869	3.672	3.394
High	3.031	4.056	3.318	3.131	3.911	3.573

Empirical Results

Commercial ImageLLMs outperform Open-source VideoLLMs in Zero-shot Settings

Commercial ImageLLMs, notably GPT-4V and GPT-4o, consistently outperform open-source VideoLLMs in zero-shot settings. GPT-4o exhibits superior performance across all GUI scenarios in complex tasks, reflected in its high scores in both multiple-choice and free-form queries, with an average of 84.8% and 3.573. Similarly, Gemini demonstrates strong capabilities in captioning and descriptive tasks within software and iOS environments, scoring 2.836 and 2.936, respectively. Further analysis reveals that GPT-4V excels in applications with minimal textual content and simple layouts, such as TikTok, health apps, and GitHub. In contrast, its performance drops in more intricate applications like Microsoft ToDo and XR software. As for VideoLLMs, their significantly poorer performance is attributed to two main factors: their inability to accurately interpret GUI content from user inputs and a lack of sufficient GUI-oriented pretraining, which is evident from their inadequate performance in basic captioning and description tasks.

Performance Variate in Different GUI Scenarios

GPT-4V and Gemini excel in common scenarios such as mobile and website interfaces but show marked deficiencies in more complex GUI environments like XR and multi-window interactions, across both captioning and intricate tasks. This performance gap highlights a significant shortfall in understanding environments where GUI elements are scattered and demand sophisticated interpretation. It emphasizes the critical need for specialized benchmarks and datasets tailored to these complex GUI scenarios, which is essential for enhancing the GUI-oriented capabilities of MLLMs, paving the way for them to become truly reliable and high-performing general control agents.

Keyframe Selection is Important for GUI-oriented Tasks

Across both basic tasks such as captioning and more complex tasks like prediction and reasoning, significant variations are evident among keyframe selection methods. GPT-4V and Gemini significantly benefit from using random-selected and human-selected keyframes, scoring approximately 0.2-0.3 points higher in both captioning and free-form tasks than those using programmatic extraction. This suggests that traditional keyframe technologies, designed for natural videos, are less effective for detecting essential GUI operations, particularly when subtle movements like mouse clicks and dynamic changes are involved. Conversely, the difference in performance is relatively smaller in Qwen-VL-Max, indicating that while keyframe selection methods are crucial for models proficient in GUI content, they exert less influence on less capable models.

Dynamic GUI Tasks Continue to Challenge MLLMs

In the fine-grained tasks, GPT-4V and GPT-4o excel with static GUI content and prediction tasks over image sequences but struggle with providing detailed descriptions for entire videos and dynamic GUI content. This discrepancy is attributed to minor variations in GUI that significantly impact descriptions. Enhancing the number of keyframes and the granularity of perception might mitigate these issues. Among VideoLLMs, ChatUnivi excels in conversational tasks by effectively leveraging contextual nuances, particularly in subsequent rounds, yet it underperforms in GUI-oriented captioning tasks. In contrast, GUI-Vid demonstrates proficiency in sequential tasks but falls short in both captioning and static content handling. This gap is linked to deficiencies in GUI-Vid’s pretraining, which lacked comprehensive GUI content crucial for effective vision-text alignment, as evidenced by its poor performance and an instruction tuning process also failed to fully address these shortcomings.

Vision Perception is Important for Sequential GUI Tasks

Integrating detailed textual information slightly outperforms purely vision-based inputs or detailed captions, akin to a Chain of Thought (CoT) setting. Surprisingly, GPT-4V excels in caption and prediction tasks with just detailed captions, providing insights on enhancing specific GUI-oriented tasks through additional textual information. However, it still falls short in more challenging tasks, such as retrieving static or dynamic content. This underscores the critical role of visual perception in GUI environments, where even minor changes can significantly impact outcomes.

Supreme Enhancement of GUI-Vid on Graphic-based Interface After Finetuned on GUI-World

As a pioneering study in training VideoLLMs as screen agents, GUI-Vid significantly outperforms the baseline model, showing an average improvement of 30% across various tasks and GUI scenarios, even surpassing the commercial ImageLLM, Qwen-VL-Max. This enhancement is particularly notable in captioning and prediction over image sequences, where GUI-Vid matches the performance of GPT-4V and Gemini-Pro. Our two-stage progressive fintuning significantly enhances the performance in all GUI scenarios. Remarkably, GUI-Vid scored 3.747 in caption tasks within the XR scenario, highlighting its potential in XR applications and the high-quality annotations provided by our dataset. However, in Multiple-Choice QA and Chatbot tasks, GUI-Vid still lags behind industry leaders like GPT-4V and Gemini-Pro, a discrepancy likely due to the baseline LLM’s weaker performance and the challenges of instruction-based fine-tuning.

Upper Bound of GUI-oriented Capability with More Keyframes and High Resolution

Our two ablation studies during the fine-tuning phase demonstrate that utilizing GUI image-text captioning data significantly enhances the model's preliminary understanding of GUI elements, outperforming training that relies solely on videos. Additionally, an increased number of keyframes correlates with improved performance across various scenarios, notably in environments featuring multiple windows and software applications. Further evidence reveals that higher image resolutions substantially boost task performance, both basic and complex, for GPT-4o. These findings underscore the potential for further developing a more robust GUI Agent.

Acknowledgement
Many thanks to Yinuo Liu, Zhengyan Fu, Shilin Zhang, Yu, Tianhe Gu, Haokuan Yuan, and Junqi Wang for their invalueble effort in this project. This project is based on methodologies and code presented in [Videochat2](https://github.com/OpenGVLab/Ask-Anything). This website is based on templates in [TrustLLM](https://trustllmbenchmark.github.io/TrustLLM-Website/) and [OSWorld](https://os-world.github.io/).

BibTeX

@misc{chen2024guiworld, title={GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents}, author={Dongping Chen and Yue Huang and Siyuan Wu and Jingyu Tang and Liuyi Chen and Yilin Bai and Zhigang He and Chenlong Wang and Huichi Zhou and Yiqiang Li and Tianshuo Zhou and Yue Yu and Chujie Gao and Qihui Zhang and Yi Gui and Zhen Li and Yao Wan and Pan Zhou and Jianfeng Gao and Lichao Sun}, year={2024}, eprint={2406.10819}, }

GUI-World Team

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This means you are free to borrow the source code of this website, we just ask that you link back to this page in the footer.

HOME

ABOUT

AUCTIONS

SHIPPING

FEES

TOOLS

HOW

FAQ

CONTACT

Original Source | Taken Source

Models	Setting	Caption		Complex Tasks			Conversation		Average
Models	Setting	Concise	Desc.	Static	Dyn.	Pred.	Round 1	Round 2	Average
ImageLLMs	Gemini-Pro-1.5	R.	3.659	2.837	2.969	2.822	3.450	3.608	3.845	3.339
	Gemini-Pro-1.5	E.	3.350	2.468	2.741	2.431	3.292	3.458	3.837	3.152
	Qwen-VL-Max	R.	2.381	1.758	2.277	2.144	2.724	3.125	3.317	2.676
		E.	2.459	1.693	2.143	1.954	2.742	3.174	3.298	2.624
		H.	2.474	1.711	2.137	2.032	2.834	3.223	3.257	2.651
	GPT-4V	R.	3.579	2.676	3.243	3.011	3.630	3.925	4.131	3.589
		E.	3.141	2.301	2.927	2.627	3.541	3.844	4.103	3.407
		H.	3.352	2.509	3.053	2.849	3.609	3.928	4.163	3.520
		C.C.	3.454	2.547	1.818	2.335	3.577	3.521	3.884	3.028
		D.C.	3.412	2.627	2.603	2.591	3.723	3.759	4.072	3.350
		H.+D.C.	3.436	2.677	2.927	2.750	3.791	3.857	4.148	3.494
	GPT-4o	H.	4.048	3.028	3.125	3.117	3.562	4.129	4.318	3.644
VideoLLMs	ChatUnivi	-	1.587	1.240	1.705	1.656	2.524	2.698	3.366	2.389
	Minigpt4Video	-	1.246	1.073	1.249	1.235	1.675	1.494	1.719	1.475
	VideoChat2	-	1.992	1.312	1.812	1.682	2.158	2.342	2.720	2.144
	GUI-Vid	-	3.562	2.058	2.376	2.090	3.435	3.080	3.260	2.847