| CARVIEW |
GUI-World: A Dataset for GUI-Oriented Multimodal Large Language Models
2University of Notre Dame, 3Microsoft Research, 4Lehigh University
- A Dataset. We propose GUI-WORLD, a comprehensive GUI dataset comprising over 12,000 videos specifically designed to assess and improve the GUI understanding capabilities of MLLMs, spanning a range of categories and scenarios, including desktop, mobile, and extended reality (XR), and representing the first GUI-oriented instruction-tuning dataset in the video domain.
- A Novel Model. Based on GUI-WORLD, we propose GUI-Vid, a GUI-oriented VideoLLM with enhanced capabilities to handle various and complex GUI tasks. GUI-Vid shows a significant improvement on the benchmark and achieves results comparable to the top-performing models
- Comprehensive Experiments and Valuable Insights. Our experiments indicate that most existing MLLMs continue to face challenges with GUI-oriented tasks, particularly in sequential and dynamic GUI content. Empirical findings suggest that improvements in vision perception, along with an increase in the number of keyframes and higher resolution, can boost performance in GUI-oriented tasks, thereby paving the way for the future of GUI agents.
Abstract
GUI-World Dataset Construction
- GUI Video Collection and Image Sequence Process: During this phase, a group of 24 undergraduate and graduate students manually collects GUI-related videos from YouTube or manual screen recording. Subsequently, these students use video editing software to transform the videos into short video clips, each containing various human operations on GUI content, and then annotate them with detailed operational descriptions.
- Diversify Types of QA through MLLM-Human Collaboration: Given that human annotations might contain grammar errors or unclear statements, we utilize the MLLM, specifically GPT-4V, to first refine the descriptions of the image sequences and then generate various types of QA focusing on static and dynamic GUI content, aiming at comprehensively tasking MLLMs with their GUI-oriented abilities. Finally, all MLLM-generated content will be carefully reviewed through human verification to ensure alignment with original human aims.
Data Statistics and Comparison
| GUI-WORLD | |
|---|---|
| Instances | 12,379 |
| Sem. | Both |
| VL | ✔️ |
| Video | ✔️ |
| Web | ✔️ |
| Mob. | ✔️ |
| Desk. | ✔️ |
| XR | ✔️ |
| Sequential | ✔️ |
| CrossApp | ✔️ |
| Dynamic | ✔️ |
| Detailed Tasks | GUI Understanding Instruction Following |
| AgentStudio | OSWorld | UGIF | AitW | Mind2Web | Rico | FerretUI | WebArena | MetaGUI | MiniWoB++ | OmniAct | MMINA |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 304 | 369 | 523 | 715,142 | 2,350 | 72,219 | 123,702 | 812 | 1,125 | 100 | 9,802 | 1,050 |
| High | High | High | High | Both | Low | Low | Low | Low | Low | Low | Low |
| ✔️ | ✔️ | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ❌ | ✔️ | ✔️ |
| ✔️ | ✔️ | ❌ | ❌ | ✔️ | ✔️ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
| ✔️ | ✔️ | ❌ | ✔️ | ✔️ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
| ❌ | ❌ | ✔️ | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ✔️ |
| ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ | ❌ | ❌ | ✔️ | ❌ |
| ❌ | ✔️ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
| ✔️ | ✔️ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ✔️ | ✔️ |
| ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| General Control | General Control | UI Grounded Instruction Following | GUI Understanding | Web Navigation | UI Code/Layout Generation | UI Grounding & Understanding | Web Navigation | Mobile Navigation | Web Navigation | Code Generation | Web Navigation |
Benchmark
| Models | Setting | Software | Website | XR | Multi | IOS | Android | Avg. | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MC | Free | MC | Free | MC | Free | MC | Free | MC | Free | MC | Free | MC | Free | |||
| ImageLLMs | Gemini-Pro-1.5 | R. | 81.7% | 3.339 | 82.6% | 3.452 | 81.2% | 3.154 | 81.2% | 2.959 | 82.0% | 3.213 | 81.6% | 3.220 | 81.7% | 3.223 |
| E. | 78.5% | 3.152 | 77.8% | 3.215 | 80.8% | 3.006 | 71.8% | 2.777 | 79.3% | 3.007 | 78.5% | 3.168 | 77.8% | 3.054 | ||
| Qwen-VL-Max | R. | 74.9% | 2.676 | 76.9% | 2.656 | 74.2% | 2.469 | 68.8% | 2.432 | 75.4% | 2.779 | 73.7% | 2.309 | 74.0% | 2.553 | |
| E. | 74.3% | 2.624 | 75.8% | 2.627 | 69.0% | 2.499 | 64.8% | 2.362 | 77.4% | 2.659 | 65.8% | 2.277 | 71.2% | 2.508 | ||
| H. | 75.8% | 2.651 | 75.5% | 2.698 | 77.6% | 2.373 | 66.9% | 2.490 | 74.3% | 2.633 | - | - | 74.0% | 2.569 | ||
| GPT-4V | R. | 81.5% | 3.589 | 80.9% | 3.648 | 82.4% | 3.200 | 75.0% | 3.452 | 82.5% | 3.614 | 78.3% | 3.515 | 79.8% | 3.503 | |
| E. | 85.1% | 3.407 | 80.1% | 3.433 | 81.8% | 2.892 | 81.9% | 3.219 | 86.4% | 3.427 | 79.9% | 3.176 | 82.6% | 3.259 | ||
| H. | 86.0% | 3.520 | 79.8% | 3.655 | 83.4% | 3.200 | 76.9% | 3.449 | 79.9% | 3.453 | - | - | 81.2% | 3.469 | ||
| D.C. | 85.0% | 3.350 | 83.1% | 3.658 | 82.3% | 3.065 | 84.2% | 3.358 | 81.6% | 3.358 | 81.7% | 3.427 | 83.0% | 3.316 | ||
| C.C. | 80.7% | 3.028 | 72.2% | 3.160 | 76.5% | 2.868 | 76.4% | 2.939 | 78.3% | 2.751 | 81.7% | 3.160 | 78.3% | 2.971 | ||
| H.+D.C. | 82.5% | 3.494 | 83.2% | 3.682 | 85.9% | 3.191 | 83.9% | 3.617 | 80.9% | 3.516 | 84.9% | 3.758 | 83.5% | 3.543 | ||
| GPT-4o | H. | 86.5% | 3.644 | 83.3% | 3.740 | 84.3% | 3.285 | 81.1% | 3.654 | 83.3% | 3.558 | 90.0% | 3.561 | 84.8% | 3.573 | |
| VideoLLMs | ChatUnivi | - | 28.4% | 2.389 | 22.2% | 2.349 | 20.6% | 2.161 | 17.5% | 2.275 | 22.6% | 2.337 | 23.0% | 2.390 | 22.4% | 2.317 |
| Minigpt4Video | - | 18.9% | 1.475 | 15.3% | 1.520 | 16.3% | 1.362 | 15.4% | 1.457 | 20.1% | 1.501 | 14.6% | 1.342 | 16.8% | 1.443 | |
| VideoChat2 | - | 45.5% | 2.144 | 42.6% | 2.221 | 44.0% | 2.005 | 40.4% | 2.222 | 40.2% | 2.169 | 44.7% | 2.119 | 42.9% | 2.147 | |
| GUI-Vid | - | 59.9% | 2.847 | 54.1% | 2.957 | 55.6% | 2.764 | 52.9% | 2.861 | 51.8% | 2.773 | 53.4% | 2.572 | 54.6% | 2.796 | |
| Models | Setting | Caption | Complex Tasks | Conversation | Average | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Concise | Desc. | Static | Dyn. | Pred. | Round 1 | Round 2 | ||||
| ImageLLMs | Gemini-Pro-1.5 | R. | 3.659 | 2.837 | 2.969 | 2.822 | 3.450 | 3.608 | 3.845 | 3.339 |
| E. | 3.350 | 2.468 | 2.741 | 2.431 | 3.292 | 3.458 | 3.837 | 3.152 | ||
| Qwen-VL-Max | R. | 2.381 | 1.758 | 2.277 | 2.144 | 2.724 | 3.125 | 3.317 | 2.676 | |
| E. | 2.459 | 1.693 | 2.143 | 1.954 | 2.742 | 3.174 | 3.298 | 2.624 | ||
| H. | 2.474 | 1.711 | 2.137 | 2.032 | 2.834 | 3.223 | 3.257 | 2.651 | ||
| GPT-4V | R. | 3.579 | 2.676 | 3.243 | 3.011 | 3.630 | 3.925 | 4.131 | 3.589 | |
| E. | 3.141 | 2.301 | 2.927 | 2.627 | 3.541 | 3.844 | 4.103 | 3.407 | ||
| H. | 3.352 | 2.509 | 3.053 | 2.849 | 3.609 | 3.928 | 4.163 | 3.520 | ||
| C.C. | 3.454 | 2.547 | 1.818 | 2.335 | 3.577 | 3.521 | 3.884 | 3.028 | ||
| D.C. | 3.412 | 2.627 | 2.603 | 2.591 | 3.723 | 3.759 | 4.072 | 3.350 | ||
| H.+D.C. | 3.436 | 2.677 | 2.927 | 2.750 | 3.791 | 3.857 | 4.148 | 3.494 | ||
| GPT-4o | H. | 4.048 | 3.028 | 3.125 | 3.117 | 3.562 | 4.129 | 4.318 | 3.644 | |
| VideoLLMs | ChatUnivi | - | 1.587 | 1.240 | 1.705 | 1.656 | 2.524 | 2.698 | 3.366 | 2.389 |
| Minigpt4Video | - | 1.246 | 1.073 | 1.249 | 1.235 | 1.675 | 1.494 | 1.719 | 1.475 | |
| VideoChat2 | - | 1.992 | 1.312 | 1.812 | 1.682 | 2.158 | 2.342 | 2.720 | 2.144 | |
| GUI-Vid | - | 3.562 | 2.058 | 2.376 | 2.090 | 3.435 | 3.080 | 3.260 | 2.847 | |
| Setting | F.K. | E.K. | Data | Software | Website | XR | Multi | IOS | Android | Avg. | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| I. | V. | MC | Free | MC | Free | MC | Free | MC | Free | MC | Free | MC | Free | MC | Free | |||
| Baseline | - | 8 | - | - | 45.5% | 2.144 | 42.6% | 2.221 | 44.0% | 2.005 | 40.4% | 2.222 | 40.2% | 2.169 | 44.7% | 2.119 | 42.9% | 2.147 |
| - | 16 | - | - | 45.1% | 2.144 | 41.8% | 2.240 | 41.0% | 2.007 | 40.7% | 2.238 | 39.9% | 2.138 | 44.7% | 2.147 | 42.2% | 2.154 | |
| GUI-Vid | 8 | 8 | ✖ | ✔ | 58.3% | 2.709 | 53.6% | 2.817 | 62.2% | 2.626 | 54.2% | 2.627 | 53.1% | 2.708 | 54.9% | 2.501 | 56.0% | 2.665 |
| ✔ | ✔ | 59.9% | 2.856 | 54.1% | 2.925 | 59.0% | 2.751 | 52.1% | 2.837 | 50.0% | 2.756 | 54.0% | 2.571 | 54.8% | 2.782 | |||
| 16 | ✖ | ✔ | 59.0% | 2.709 | 55.1% | 2.821 | 62.8% | 2.645 | 53.3% | 2.624 | 55.5% | 2.727 | 55.7% | 2.501 | 56.9% | 2.671 | ||
| ✔ | ✔ | 59.9% | 2.847 | 54.1% | 2.957 | 55.6% | 2.764 | 52.9% | 2.861 | 51.8% | 2.772 | 53.4% | 2.572 | 54.6% | 2.796 | |||
| Res. | Desc. | Conv. | Dyn. | Static | Caption | Average |
|---|---|---|---|---|---|---|
| Low | 2.794 | 3.912 | 3.150 | 2.869 | 3.672 | 3.394 |
| High | 3.031 | 4.056 | 3.318 | 3.131 | 3.911 | 3.573 |
Empirical Results
Commercial ImageLLMs outperform Open-source VideoLLMs in Zero-shot Settings
- Commercial ImageLLMs, notably GPT-4V and GPT-4o, consistently outperform open-source VideoLLMs in zero-shot settings. GPT-4o exhibits superior performance across all GUI scenarios in complex tasks, reflected in its high scores in both multiple-choice and free-form queries, with an average of 84.8% and 3.573. Similarly, Gemini demonstrates strong capabilities in captioning and descriptive tasks within software and iOS environments, scoring 2.836 and 2.936, respectively. Further analysis reveals that GPT-4V excels in applications with minimal textual content and simple layouts, such as TikTok, health apps, and GitHub. In contrast, its performance drops in more intricate applications like Microsoft ToDo and XR software. As for VideoLLMs, their significantly poorer performance is attributed to two main factors: their inability to accurately interpret GUI content from user inputs and a lack of sufficient GUI-oriented pretraining, which is evident from their inadequate performance in basic captioning and description tasks.
Performance Variate in Different GUI Scenarios
GPT-4V and Gemini excel in common scenarios such as mobile and website interfaces but show marked deficiencies in more complex GUI environments like XR and multi-window interactions, across both captioning and intricate tasks. This performance gap highlights a significant shortfall in understanding environments where GUI elements are scattered and demand sophisticated interpretation. It emphasizes the critical need for specialized benchmarks and datasets tailored to these complex GUI scenarios, which is essential for enhancing the GUI-oriented capabilities of MLLMs, paving the way for them to become truly reliable and high-performing general control agents.
Keyframe Selection is Important for GUI-oriented Tasks
Across both basic tasks such as captioning and more complex tasks like prediction and reasoning, significant variations are evident among keyframe selection methods. GPT-4V and Gemini significantly benefit from using random-selected and human-selected keyframes, scoring approximately 0.2-0.3 points higher in both captioning and free-form tasks than those using programmatic extraction. This suggests that traditional keyframe technologies, designed for natural videos, are less effective for detecting essential GUI operations, particularly when subtle movements like mouse clicks and dynamic changes are involved. Conversely, the difference in performance is relatively smaller in Qwen-VL-Max, indicating that while keyframe selection methods are crucial for models proficient in GUI content, they exert less influence on less capable models.
Dynamic GUI Tasks Continue to Challenge MLLMs
In the fine-grained tasks, GPT-4V and GPT-4o excel with static GUI content and prediction tasks over image sequences but struggle with providing detailed descriptions for entire videos and dynamic GUI content. This discrepancy is attributed to minor variations in GUI that significantly impact descriptions. Enhancing the number of keyframes and the granularity of perception might mitigate these issues. Among VideoLLMs, ChatUnivi excels in conversational tasks by effectively leveraging contextual nuances, particularly in subsequent rounds, yet it underperforms in GUI-oriented captioning tasks. In contrast, GUI-Vid demonstrates proficiency in sequential tasks but falls short in both captioning and static content handling. This gap is linked to deficiencies in GUI-Vid’s pretraining, which lacked comprehensive GUI content crucial for effective vision-text alignment, as evidenced by its poor performance and an instruction tuning process also failed to fully address these shortcomings.
Vision Perception is Important for Sequential GUI Tasks
Integrating detailed textual information slightly outperforms purely vision-based inputs or detailed captions, akin to a Chain of Thought (CoT) setting. Surprisingly, GPT-4V excels in caption and prediction tasks with just detailed captions, providing insights on enhancing specific GUI-oriented tasks through additional textual information. However, it still falls short in more challenging tasks, such as retrieving static or dynamic content. This underscores the critical role of visual perception in GUI environments, where even minor changes can significantly impact outcomes.
Supreme Enhancement of GUI-Vid on Graphic-based Interface After Finetuned on GUI-World
As a pioneering study in training VideoLLMs as screen agents, GUI-Vid significantly outperforms the baseline model, showing an average improvement of 30% across various tasks and GUI scenarios, even surpassing the commercial ImageLLM, Qwen-VL-Max. This enhancement is particularly notable in captioning and prediction over image sequences, where GUI-Vid matches the performance of GPT-4V and Gemini-Pro. Our two-stage progressive fintuning significantly enhances the performance in all GUI scenarios. Remarkably, GUI-Vid scored 3.747 in caption tasks within the XR scenario, highlighting its potential in XR applications and the high-quality annotations provided by our dataset. However, in Multiple-Choice QA and Chatbot tasks, GUI-Vid still lags behind industry leaders like GPT-4V and Gemini-Pro, a discrepancy likely due to the baseline LLM’s weaker performance and the challenges of instruction-based fine-tuning.
Upper Bound of GUI-oriented Capability with More Keyframes and High Resolution
Our two ablation studies during the fine-tuning phase demonstrate that utilizing GUI image-text captioning data significantly enhances the model's preliminary understanding of GUI elements, outperforming training that relies solely on videos. Additionally, an increased number of keyframes correlates with improved performance across various scenarios, notably in environments featuring multiple windows and software applications. Further evidence reveals that higher image resolutions substantially boost task performance, both basic and complex, for GPT-4o. These findings underscore the potential for further developing a more robust GUI Agent.
Acknowledgement
BibTeX
@misc{chen2024guiworld,
title={GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents},
author={Dongping Chen and Yue Huang and Siyuan Wu and Jingyu Tang and Liuyi Chen and Yilin Bai and Zhigang He and Chenlong Wang and Huichi Zhou and Yiqiang Li and Tianshuo Zhou and Yue Yu and Chujie Gao and Qihui Zhang and Yi Gui and Zhen Li and Yao Wan and Pan Zhou and Jianfeng Gao and Lichao Sun},
year={2024},
eprint={2406.10819},
}
GUI-World Team