CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Mon, 22 Dec 2025 23:53:41 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"6949da05-299d" expires: Sun, 28 Dec 2025 22:04:40 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: C956:2685F2:7F4A2D:8F0091:6951A720 accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 21:54:40 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210073-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766958881.574873,VS0,VE204 vary: Accept-Encoding x-fastly-request-id: 01e576ce73a62ee602ee56648ba695e138825a66 content-length: 3589 Yiheng Xu | OpenAI

Yiheng Xu

Researcher
OpenAI
charlesyihengxu[at]gmail[dot]com [Google Scholar]
[GitHub]
[LinkedIn]
[Twitter]

From Digital Automation to Autonomous Agents

I develop scalable methods that advance AI from digital workflow automation toward fully autonomous agents. My research spans three interconnected directions:

Digital Workflow Automation via Diverse Interface Understanding: Building models that can interpret across diverse digital interfaces — from unstructured, visually rich documents (LayoutLM, DiT, DocBank) to structured web pages (MarkupLM).
Scaling Machine-like Autonomous Coding Agents: Designing and training agents that operate with machine-native efficiency through command-line and API-based interfaces, exemplified by Lemur, Qwen3 Coder, and Qwen Code.
Towards Human-like Autonomous Computer Use Agents: Pushing the frontier of agents that can interact in human-designed GUI environments, including Aguvis, AgentTrek, and Qwen2.5-VL.

Selected Publications

Qwen3 Coder
Core Contributor (Responsible for CLI and Web Agent Capability and Framework (Qwen-Code))
[Blog] [Code]
Qwen2.5-VL Technical Report
Core Contributor (Responsible for Computer/Mobile Using Agent Capability)
[Blog] [PDF] [Code]
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Yiheng Xu*, Zekun Wang*, Junli Wang*, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
ICML 2025
[PDF] [Project]
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
Yiheng Xu*, Dunjie Lu*, Zhennan Shen*, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu
ICLR 2025 Spotlight (Top 5%)
[PDF] [Project]
Lemur: Harmonizing Natural Language and Code for Language Agents
Yiheng Xu*, Hongjin Su*, Chen Xing*, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou, Yitao Liu, Tianbao Xie, Zhoujun Cheng, Siheng Zhao, Lingpeng Kong, Bailin Wang, Caiming Xiong, Tao Yu
ICLR 2024 Spotlight (Top 5%)
[PDF] [Code] [Model] [Blog]
DiT: Self-supervised Pre-training for Document Image Transformer
Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei
ACM Multimedia 2022
[PDF] [Code] [Demo]
MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding
Junlong Li*, Yiheng Xu*, Lei Cui, Furu Wei
ACL 2022
[PDF] [Code] [Model] [Blog]
XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding
Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei
ACL 2022 Findings
[PDF] [Data]
LayoutReader: Pre-training of Text and Layout for Reading Order Detection
Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, Furu Wei
EMNLP 2021
[PDF] [Code] [Blog]
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Yang Xu*, Yiheng Xu*, Tengchao Lv*, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou
ACL 2021
[PDF] [Code] [Model] [Blog]
DocBank: A Benchmark Dataset for Document Layout Analysis
Minghao Li*, Yiheng Xu*, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, Ming Zhou
COLING 2020
[PDF] [Code] [Blog]
LayoutLM: Pre-training of Text and Layout for Document Image Understanding
Yiheng Xu*, Minghao Li*, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou
KDD 2020
[PDF] [Code] [Model] [Blog] [PaperDigest Most Influential Papers] [ICBS 2024 Frontiers of Science Award]

Service

Conference Reviewer: AAAI, ACL, COLING, EMNLP, ICLR, ICML, MM
Journal Reviewer: IEEE Transactions on Multimedia, Neurocomputing, IJDAR

Original Source | Taken Source