| CARVIEW |
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Siheng Zhao1, Ruisheng Cao1, Toh Jing Hua1, Zhoujun Cheng1, Dongchan Shin1, Fangyu Lei1, Yitao Liu1, Yiheng Xu1, Shuyan Zhou3, Silvio Savarese2, Caiming Xiong2, Victor Zhong4, Tao Yu1
2025-07-28: Major Upgrade! OSWorld has been enhanced and is now OSWorld-Verified with comprehensive improvements: fixed community-reported examples, AWS support reducing evaluation time to within 1 hour, and updated benchmark results. See the verified benchmark results in the Benchmark section below. Please compare your OSWorld results with the new benchmark results when running the latest version.
Abstract
OSWorld Environment Infrastructure
Data Statistics and Comparison
Key statistics of
OSWorld.
The "Supp. tasks" refers to the Windows-based tasks, that could
only be used after
activation due to copyright restrictions.
Distribution of task instructions in OSWorld
based on the app domains and operation
types to showcase the content intuitively.
**The columns indicate:** whether they provide a controllable executable environment *(Control. Exec. Env.)*, the ease of adding new tasks involving arbitrary applications in open domains *(Environment Scalability)*, support for multimodal agent evaluation *(Multimodal Support)*, support for and inclusion of cross-app tasks *(Cross-App)*, capability to start tasks from an intermediate initial state *(Intermediate Init. State)*, and the number of execution-based evaluation functions *(# Exec.-based Eval. Func.)*.
| OSWorld | |
|---|---|
| # Instances (# Templates) |
369 |
| Control. Exec. Env. |
Computer |
| Environment Scalability? | ✔️ |
| Multimodal Support? |
✔️ |
| Cross-App? | ✔️ |
| Intermediate Init. State? |
✔️ |
| # Exec.-based Eval. Func. |
134 |
| GAIA | Mind2Web | WebLINX | PixelHelp | MetaGUI | AitW | OmniAct | ScreenAgent | AgentBench | InterCode | MiniWoB++ | WebShop | WebArena | VisualWebArena | WorkArena | WikiHow | AssistGUI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 466 |
2350 | 2337 | 187 | 1125 | 30k | 9802 | 70 | 1091 | 1350(3) | 125 | 12k(1) | 812(241) | 910(314) | 23k(29) | 150(16) | 100 |
| ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | Multi-isolated | Code | Web | Web | Web | Web | Web | Mobile | ❌ |
| - | - | - | - | - | - | - | - | ❌ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| ❌ |
✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
| ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| ❌ | ✔️ | ✔️ | ❌ |
❌ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ | ✔️ |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 3 | 125 | 1 | 5 | 6 | 7 | 16 | 2 |
Benchmark
Important Notice: Google Drive Tasks (2025-07-28)
OSWorld contains 8 Google Drive-related tasks that may encounter setup issues during task initialization due to IP changes or other network-related factors, even when following our configuration guidelines correctly.
Two acceptable approaches for evaluation:
- Manual Adjustment: You can manually configure these 8 tasks to complete the full 369 tasks evaluation
- Exclude Tasks: You can exclude these 8 tasks and run 361 tasks instead - this is officially permitted and acceptable
Both approaches are valid for benchmark comparison and leaderboard submission.
Results
These are official results evaluated by our team under unified settings and environment. All models are tested with consistent evaluation protocols to ensure fair comparison.
For self-reported results and progress trends across different modalities, click here.
All verified trajectories are hosted on Hugging Face for community analysis.
A General model is a model with broad, general-purpose capabilities. "Computer use" is one capability that can be elicited via prompting; the model itself can still perform other tasks such as dialogue and code generation. A Specialized model is trained specifically to serve as a computer-use agent; other capabilities are out of scope and are not emphasized in the corresponding reports. An Agentic framework organizes one or more General and Specialized models into a structured workflow—commonly, a GPT-family model acts as the planner while a proprietary or task-specific model serves as the grounder.
We will add new paradigms as they emerge.
Loading verified benchmark data...
Analysis
Videos
@Yannic Kilcher
@Wes Roth
@hu-po
@Dylan Curious
@WorldofAI
@Gourcer
@AI Explained
@Fireship
@1littlecoder
Acknowledgement
We thank Sida Wang, Peter Shaw, Alane Suhr, Luke Zettlemoyer, Haoyuan Wu, Junli Wang, Chengyou Jia, Junlin Yang, Junlei Zhang, Chen Henry Wu, Pengcheng Yin, Shunyu Yao, Xing Han Lu, Siva Reddy, Ruoxi Sun, Zhiyuan Zeng, and Lei Li for their helpful feedback on this work
Acknowledgement for OSWorld-Verified
Evaluation
Local Evaluation
Public Evaluation
FAQ
What is the username and password for the virtual machines?
The username and password for the virtual machines are as follows (for provider vmware, virtualbox and docker): we set the account credentials for Ubuntu as user / password.
For cloud service providers like aws, to prevent attacks due to weak passwords, we default to osworld-public-evaluation.
If you make further modifications, remember to set the client_password variable and pass it to DesktopEnv and Agent (if supported) when running experiments.
Some features like setting up proxy require the environment to have the client VM password to obtain sudo privileges, and for some OSWorld tasks, the agent needs the password to obtain sudo privileges to complete them.
How to setup the account and credentials for Google and Google Drive?
See Account Guideline.
What should I do if Google Drive tasks fail to initialize properly?
OSWorld contains 8 Google Drive-related tasks that may encounter setup issues during initialization due to various factors:
Common Issues:
- IP address changes causing authentication problems
- Network restrictions or firewalls
- Google API rate limiting or access restrictions
- Regional availability limitations
Option 1 - Manual Configuration: Manually troubleshoot and configure these 8 tasks to complete the full 369-task evaluation.
Option 2 - Task Exclusion: Exclude these 8 tasks and run the remaining 361 tasks - this is officially permitted and acceptable for benchmark evaluation.
Both approaches are valid for research comparison and leaderboard submission. Please specify which approach you used when reporting your results.
How can I configure a proxy for the VM (if I'm behind the GFW, or I don't want some of my tasks to be identified as bot and get lower scores)?
If you want to set it up yourself, please refer to Proxy Guideline.
We also provide a pre-configured solution based on dataimpulse, please refer to proxy-setup section in PUBLIC_EVALUATION_GUIDELINE.
BibTeX
@misc{OSWorld,
title={OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments},
author={Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu},
year={2024},
eprint={2404.07972},
archivePrefix={arXiv},
primaryClass={cs.AI}
}