Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Each task is validated and categorized by interface modality (GUI, CLI, or hybrid) and difficulty level, ensuring both diversity and challenge. Notably, ScienceBoard supports cross-application workflows, such as generating reports in TeXstudio based on prior analysis in ChimeraX—enabling end-to-end scientific automation. The benchmark pushes beyond QA or code generation, setting a new bar for evaluating agents on tool use, coding, visual/textual reasoning, and domain-specific knowledge in real research contexts.


Evaluations

Main Settings

Even state-of-the-art models like GPT-4o and Claude achieve only ~15% average success on ScienceBoard. Open-source agents perform slightly worse, often dropping below 12%, and sometimes approaching 0% in specific task categories—highlighting a significant gap compared to human performance. The results are shown in Table 1

Domain-Specific Challenges: Agents perform relatively well on algebra and biochemistry tasks, but struggle significantly in geospatial and astronomy domains. This gap stems from two factors: (1) GUI-heavy interactions in GIS/astronomy are harder to ground visually than CLI-based tasks, and (2) these domains feature dense, complex visuals (e.g., maps, star charts) that strain current models' spatial reasoning abilities.

Impact of Observations: Multimodal input improves performance. The best results come from combining screenshots with a11ytree representations, offering both visual grounding and structured element data. In contrast, Set-of-Mark (SoM) sometimes introduces noise, especially in visually crowded interfaces like Celestia.

Disentangled Planning and Action

Modular Design Boosts Performance: ScienceBoard experiments reveal that separating planning from execution significantly improves performance. When GPT-4o is used solely as a planner and paired with a specialized VLM or GUI action model like OS-ATLAS or UI-TARS as the executor, success rates increase notably across domains.

Implication: These findings highlight the potential of building multi-agent systems where different components specialize in distinct subtasks—planning, grounding, or domain understanding—paving the way for scalable and adaptable scientific agents.

Conclusion

ScienceBoard represents a major step toward building intelligent, computer-using agents for real scientific workflows. By combining a realistic, multimodal environment with a challenging benchmark grounded in domain expertise, ScienceBoard enables rigorous evaluation of agents in tasks far beyond static QA or code generation. Our findings reveal that even top-tier models fall short of human performance, especially in visually complex or domain-specific scenarios. However, modular designs, multimodal input, and agent specialization show promising gains—pointing to a path forward. ScienceBoard lays the foundation for the next generation of AI research assistants. We invite the community to explore, evaluate, and build upon this platform to accelerate progress toward truly autonomous scientific discovery.

Acknowledgement

We would like to thank OSWorld authors for helping us tackle various issues in building infra and task evaluation, as well as the Cambrian authors for providing this webpage template.

BibTeX

@article{sun2025scienceboard,
   title={ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows},
   author={Qiushi Sun and Zhoumianze Liu and Chang Ma and Zichen Ding and Fangzhi Xu and Zhangyue Yin and Haiteng Zhao and Zhenyu Wu and Kanzhi Cheng and Zhaoyang Liu and others},
   year={2025},
   journal={arXiv preprint arXiv:2505.19897}
}