| CARVIEW |
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
UI-Vision
Core Capabilities
Task Examples
Action Space
| Action | Description |
|---|---|
| Move(x, y) | Move the mouse to the specified coordinates. |
| Click(x, y, button) | Click the specified button at the given coordinates. |
| Typing('Hello') | Types a specified string. |
| Hotkey('ctrl', 'c') | Performs individual or combination hotkeys. |
| Drag([x1,y1], [x2,y2]) | Drags the mouse from start (x1,y1) to end (x2,y2). |
Platforms
Applications and tools analyzed in our research
Development (24.7%)
Browsers (4.7%)
Productivity (32.9%)
Creativity (18.8%)
Entertainment (9.4%)
Dataset & Benchmark
Dataset Composition
Each application is represented by multiple screenshots capturing different states and functionalities, ensuring comprehensive coverage of UI components and interaction patterns.
Domain distribution across the UI-Vision dataset
Benchmark Tasks
Element Grounding
Layout Grounding
Action Prediction
Task Progression
Tasks increase in complexity from basic element identification to complex action sequences, enabling comprehensive evaluation of AI capabilities in GUI environments.
Results & Insights
Model Performance Overview
Open VLMs
Closed VLMs
Small GUI Agents
Large GUI Agents
Model Size Matters
Larger models (50B+ parameters) consistently outperform smaller counterparts across all tasks
Specialization GUI Agents
GUI-specialized models show significant advantages over general-purpose visual language models
Task Complexity
Performance decreases with task complexity, with spatial understanding being the most challenging
Performance by Domain
Detailed Performance Results
| Model | Basic | Functional | Spatial | Final Avg | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ed | Br | De | Pr | Cr | En | Overall | Ed | Br | De | Pr | Cr | En | Overall | Ed | Br | De | Pr | Cr | En | Overall | ||
| (215) | (56) | (376) | (605) | (438) | (82) | (1772) | (215) | (56) | (376) | (605) | (438) | (82) | (1772) | (212) | (31) | (338) | (740) | (586) | (28) | (1935) | ||
| Closed-Source VLMs | ||||||||||||||||||||||
| GPT-4o | 2.23 | 0.00 | 1.86 | 1.16 | 1.14 | 4.88 | 1.58 | 1.40 | 0.00 | 3.19 | 0.83 | 0.91 | 3.66 | 1.52 | 0.94 | 0.00 | 1.48 | 1.22 | 0.51 | 3.57 | 1.03 | 1.38 |
| Claude-3.7-Sonnet | 6.51 | 12.5 | 7.98 | 11.24 | 9.13 | 11.0 | 9.48 | 5.12 | 7.14 | 8.24 | 9.92 | 6.16 | 4.88 | 7.73 | 6.60 | 9.68 | 7.69 | 7.43 | 7.85 | 10.7 | 7.60 | 8.27 |
| Open-Source VLMs | ||||||||||||||||||||||
| MiniCPM-V-8B | 4.19 | 21.4 | 7.71 | 7.44 | 3.65 | 18.3 | 7.11 | 4.19 | 19.6 | 6.38 | 4.63 | 2.97 | 11.0 | 5.30 | 0.47 | 3.23 | 1.78 | 0.27 | 0.17 | 3.57 | 1.45 | 4.34 |
| Open-Source GUI Agents (<8B) | ||||||||||||||||||||||
| UI-TARS-7B | 15.4 | 41.1 | 21.8 | 21.2 | 13.2 | 39.0 | 20.1 | 20.5 | 41.1 | 25.5 | 26.5 | 16.0 | 45.1 | 24.3 | 6.60 | 12.9 | 11.0 | 9.2 | 5.8 | 17.9 | 8.37 | 17.6 |
| Open-Source GUI Agents (>8B) | ||||||||||||||||||||||
| UI-TARS-72B | 30.7 | 48.2 | 32.7 | 33.6 | 21.9 | 51.2 | 31.4 | 29.8 | 46.4 | 30.9 | 34.1 | 22.6 | 36.6 | 30.5 | 13.7 | 16.1 | 19.2 | 15.4 | 11.1 | 25.0 | 14.7 | 25.5 |
Table 1: Performance results across different settings and domains. Values shown are success rates (%). Domains: Ed (Education), Br (Browsers), De (Development), Pr (Productivity), Cr (Creativity), En (Entertainment). Numbers in parentheses indicate sample sizes.
Citation
If you find UI-Vision useful in your research, please consider citing our paper:
@misc{nayak2025uivisiondesktopcentricguibenchmark,
title={UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction},
author={Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Juan A. Rodriguez and
Montek Kalsi and Rabiul Awal and Nicolas Chapados and M. Tamer Özsu and
Aishwarya Agrawal and David Vazquez and Christopher Pal and Perouz Taslakian and
Spandana Gella and Sai Rajeswar},
year={2025},
eprint={2503.15661},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.15661},
}