CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Tue, 28 Oct 2025 06:08:54 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"69005df6-53a0" expires: Tue, 30 Dec 2025 00:52:53 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: EB41:328FD3:9736FA:A9C21C:6953200C accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 00:42:53 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210029-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767055373.045542,VS0,VE208 vary: Accept-Encoding x-fastly-request-id: 8730ba15ce395c4014d850c9cb9f8ac52d5faf2c content-length: 4959 OS-Atlas Homepage

OS-ATLAS: A Foundation Action Model For Generalist GUI Agents

Zhiyong Wu^1*, Zhenyu Wu^1,2*, Fangzhi Xu^1*, Yian Wang^2*, Qiushi Sun³, Chengyou Jia¹,

Kanzhi Cheng¹, Zichen Ding¹, Liheng Chen³, Yu Qiao¹

¹Shanghai AI Lab, ²Shanghai Jiaotong University, ³The University of Hong Kong,
^*Equal contribution

Paper Code Twitter 🤗 Models 🤗 13M DATA

🏆

ScreenSpot-V2 arXiv

💻 Demo1: Hide `__pycache__` in VSCode.

🌐 Demo2: Enlarge Font Size in Chrome.

📱 Demo3: Send SMS via Simple Messenger.

Overview

An overview of the functions of OS-Atlas and its superior performance across various dimensions.

Abstract

Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas—a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested substantial engineering effort into developing a toolkit for synthesizing multi-platform GUI grounding data. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs. All our data, code, and models will be made publicly available.

Grounding Data Collection

Training Pipeline

Overall training pipeline of OS-Atlas. We first perform large-scale pre-training using 13M GUI grounding data collected to build OS-Atlas-Base. Next, we conduct multitask fine-tuning on agent data, resulting in OS-Atlas.

Experiments: Grounding Tasks

ScreenSpot

Grounding accuracy on ScreenSpot. The best results are in bold.

OS-World

Successful rate on OS World benchmark, divided by apps (domains).

Experiments: Agent Tasks

Web & Desktop Platform

Results on web and desktop tasks. InternVL-2/Qwen2-VL and OS-Atlas-4/7B differ in that the former utilizes the original checkpoints, while the latter is fine-tuned on OS-Atlas-Base.

Mobile Platform

Results on mobile tasks.

BibTeX

@article{wu2024atlas,
        title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
        author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},
        journal={arXiv preprint arXiv:2410.23218},
        year={2024}
      }

Original Source | Taken Source