| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Fri, 21 Feb 2025 07:28:50 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"67b82b32-661a"
expires: Sun, 28 Dec 2025 16:05:31 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: BCF9:3FD64F:7C6C6F:8B829B:695152F2
accept-ranges: bytes
age: 0
date: Sun, 28 Dec 2025 15:55:31 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210076-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766937331.495009,VS0,VE212
vary: Accept-Encoding
x-fastly-request-id: 8f8fd8eb2135d111a224722ca7b84be00d364bcc
content-length: 5816
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Table of Contents
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Yiheng Xu*1,
Zekun Wang*1,
Junli Wang*1,
Dunjie Lu1,
Tianbao Xie1,
Amrita
Saha2,
Doyen Sahoo2,
Tao Yu^1,
Caiming Xiong^2
1University of Hong Kong
2Salesforce Research
*Equal contribution
^Corresponding authors
Abstract
AGUVIS is a unified pure vision-based framework for autonomous GUI agents that can operate across various platforms (web, desktop, mobile). Unlike previous approaches that rely on textual representations, AGUVIS leverages unified purely vision-based observations and a consistent action space to ensure better generalization across different platforms.
Key Features & Contributions
- 🔍 Pure Vision Framework: First fully autonomous pure vision GUI agent capable of performing tasks independently without relying on closed-source models
- 🔄 Cross-Platform Unification: Unified action space and plugin system that works consistently across different GUI environments
- 📊 Comprehensive Dataset: Large-scale dataset of GUI agent trajectories with multimodal grounding and reasoning
- 🧠 Two-Stage Training: Novel training pipeline focusing on GUI grounding followed by planning and reasoning
- 💭 Inner Monologue: Explicit planning and reasoning capabilities integrated into the model training
Our framework demonstrates state-of-the-art performance in both offline and real-world online scenarios, offering a more efficient and generalizable approach to GUI automation.
🔥 AGUVIS on OSWorld
| Planner | Grounder | OS | Calc | Impress | Writer | VLC | TB | Chrome | VSC | GIMP | WF | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 8.33 | 0.00 | 6.77 | 4.35 | 16.10 | 0.00 | 4.35 | 4.35 | 3.85 | 5.58 | 5.03 | |
| GPT-4o | SoM | 20.83 | 0.00 | 6.77 | 4.35 | 6.53 | 0.00 | 4.35 | 4.35 | 0.00 | 3.60 | 4.59 |
| SecClick | 16.67 | 0.00 | 12.76 | 4.35 | 23.52 | 6.67 | 10.86 | 8.70 | 11.54 | 7.92 | 9.21 | |
| OS-Atlas-Base-4B | 20.83 | 2.23 | 14.89 | 8.70 | 23.52 | 13.33 | 15.22 | 13.04 | 15.38 | 7.92 | 11.65 | |
| OS-Atlas-Base-7B | 25.00 | 4.26 | 17.02 | 8.70 | 29.41 | 26.67 | 19.57 | 17.39 | 19.23 | 8.91 | 14.63 | |
| AGUVIS-7B | 41.67 | 4.26 | 8.51 | 17.38 | 17.65 | 26.67 | 17.23 | 17.39 | 34.62 | 5.58 | 14.79 | |
| AGUVIS-72B | 20.83 | 4.26 | 11.03 | 13.04 | 12.41 | 20.00 | 15.06 | 17.39 | 11.54 | 3.60 | 10.26 | |
| Human | 75.00 | 61.70 | 80.85 | 73.91 | 70.59 | 46.67 | 78.26 | 73.91 | 73.08 | 73.27 | 72.36 | |
Training Pipeline
Explore AGUVIS Examples
Task Instruction
Show me the list of Men's Blazers, Black, Size M on uniqlo.
Video Recording
Task Instruction
Delete all but one of any expenses in arduia pro expense that are exact duplicates, ensuring at
least one instance of each unique expense remains.
Video Recording
Task Instruction
Export the current document into PDF, keep the file name.
Video Recording
Offline Experiments
ScreenSpot
Comparison of various planners and grounding methods on ScreenSpot across various device and input modalities. The top part of table shows the results on original instructions evaluation setting while the bottom part shows results on self-plan evaluation setting. Best results are in bold.
Multimodal-Mind2Web
Performance comparison on Multimodal Mind2Web across different settings. We report element accuracy (Ele.Acc), Operation F1 (Op.F1), and step success rate (Step SR). Best results are in bold. "T" means the textual HTML code as inputs. "I" means the GUI images as inputs.
AndroidControl
Step Accuracy of out-of-domain (OOD) data on AndroidControl under high-level tasks and low-level tasks. Best performance is in bold. "Acc.Tree" means the textual accessibility tree.
Online Experiments
Mind2Web-Live
Task Success Rate (SR) and efficiency costs on Mind2Web-Live. "USD" Efficiency is calculated by dividing the model's total inference cost in USD by the number of successful steps.
AndroidWorld
Task Success Rates (SR) on AndroidWorld and MobileMiniWob++. Best results are in bold.
BibTeX
@misc{xu2024aguvis,
title={Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction},
author={Yiheng Xu and Zekun Wang and Junli Wang and Dunjie Lu and Tianbao Xie and Amrita Saha and Doyen Sahoo and Tao Yu and Caiming Xiong},
year={2024},
eprint={2412.04454},
archivePrefix={arXiv},
primaryClass={cs.CL}
}