Carview!

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Sat, 04 Oct 2025 22:44:44 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"68e1a35c-4f1a" expires: Tue, 30 Dec 2025 12:20:01 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: BA86:2D8B9D:A0B290:B474C4:6953C115 accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 12:10:01 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210028-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767096602.507517,VS0,VE212 vary: Accept-Encoding x-fastly-request-id: b8642d718423a47fd1c047dafdc4ec50668ddd5a content-length: 5569 Xingyu Fu

Xingyu Fu (府星妤)

Email: xingyufu@princeton.edu

👋 I am a Postdoctral Fellow at Princeton University's PLI, working with Zhuang Liu, Danqi Chen, and Sanjeev Arora.

My research primarily focuses on generative multimodal models at the intersection between vision and natural language (e.g., multimodal LLMs, text-to-image/video generation, omni models). I aim to improve the perception and reasoning capabilities of multimodal models by bridging them together. I have built better evaluations for emergent abilities, and used synthetic data to design models that can better perceive and reason about the multimodal world.

I did my Ph.D. in Computer Science at the University of Pennsylvania advised by Prof. Dan Roth. During my PhD, I have interned at Microsoft and AWS AI Labs. I received my B.S. in Computer Science from UIUC in 2020, where I was very fortunate to be advised by Prof. Jiawei Han.

I'm always open to collaborations. Send me an email if you're interested!

🌟 Recent highlights

[Jul, 2025] I'm presenting ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding @ ICML 2025. See you in Vancouver!
[Apr, 2025] I'll begin my postdoc journey in August at Princeton AI Lab. Hello New Jersey :D
[Oct, 2024] I'm presenting Commonsense-T2I: Can Text-to-Image Generation Models Understand Commonsense? @ COLM 2024. See you in Philadelphia!
[Oct, 2024] I'm presenting BLINK: Multimodal Large Language Models Can See but Not Perceive @ ECCV 2024 in the great Milano, Italy.

📑 Research Projects

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, Keyush Shah, Chung Un Lee, Yejin Choi, James Zou, Dan Roth*, Chris Callison-Burch*

Arxiv 2025 Sep

[paper] [website] [code] [dataset] [twitter]
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang

ICML 2025

[paper] [website] [code] [dataset] [twitter]

Science-T2I: Addressing Scientific Illusions in Image Synthesis

Jialuo Li, Wenhao Chai, Xingyu Fu, Haiyang Xu, Saining Xie

CVPR 2025

[paper] [website] [code] [dataset]

MUIRBENCH: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang*, Xingyu Fu*, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, Muhao Chen

ICLR 2025

[paper] [website] [code] [dataset] [twitter]

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Yushi Hu*, Weijia Shi*, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna

NeurIPS 2024

[paper] [website] [code] [twitter]

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, Dan Roth

COLM 2024

[paper] [website] [code] [dataset] [twitter]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu*, Yushi Hu*, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma†, Ranjay Krishna†

ECCV 2024, Spotlight of cVinW@CVPR 2024, 36K total downloads.

[paper] [website] [code] [dataset] [eval] [twitter] [ Paper of the day]

Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?

Bangzheng Li, Ben Zhou, Fei Wang, Xingyu Fu, Dan Roth, Muhao Chen

NAACL. 2024.

[paper] [website] [code] [dataset]

ImagenHub: Standardizing the evaluation of conditional image generation models

Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, Wenhu Chen

ICLR. 2024.

[paper] [website] [code] [dataset] [visualization]

Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

Xingyu Fu, Sheng Zhang, Gukyeong Kwon, Pramuditha Perera, Henghui Zhu, Yuhao Zhang, Alexander Hanbo Li, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, Dan Roth, Bing Xiang

ACL findings. 2023.

[paper] [website] [code]

Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering

Xingyu Fu, Ben Zhou, Sihao Chen, Mark Yatskar, Dan Roth

Arxiv. 2023.

[paper]

There's a Time and Place for Reasoning Beyond the Image

Xingyu Fu, Ben Zhou, Ishaan Chandratreya, Carl Vondrick, Dan Roth

ACL (Oral). 2022.

[paper] [code]

Design Challenges in Low-resource Cross-lingual Entity Linking

Xingyu Fu*, Weijia Shi*, Xiaodong Yu, Zian Zhao, Dan Roth

EMNLP. 2020.

[paper] [code]

Constrained sequence-to-sequence semitic root extraction for enriching word embeddings

Ahmed El-Kishky*, Xingyu Fu*, Aseel Addawood, Nahil Sobh, Clare Voss, Jiawei Han

WANLP @ ACL. 2019.

[paper]

🎤 Invited Talks

[2024/09] : UPenn Clunch. Title: Better Evaluations for Generative Multimodal Models.
[2024/06] : Microsoft Azure AI, AI reading group. Title: BLINK: Multimodal Large Language Models Can See but Not Perceive.
[2023/07] : Amazon AWS Responsible AI Group, AI reading group. Title: Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge.

💼 Work Experience

[Summer 2024], Research Intern @ Microsoft, Seattle WA
[Summer 2022], Research Intern @ AWS AI Labs, New York City NY
[Summer 2019], Research Intern @ Cogcomp from UPenn, Philadelphia PA
[Summer 2018], Research Assistant @ DMG from UIUC, Champaign IL

Original Source | Taken Source

Xingyu Fu (府星妤)

🌟 Recent highlights

📑 Research Projects

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Science-T2I: Addressing Scientific Illusions in Image Synthesis

MUIRBENCH: A Comprehensive Benchmark for Robust Multi-image Understanding

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

BLINK: Multimodal Large Language Models Can See but Not Perceive

Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?

ImagenHub: Standardizing the evaluation of conditional image generation models

Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering

There's a Time and Place for Reasoning Beyond the Image

Design Challenges in Low-resource Cross-lingual Entity Linking

Constrained sequence-to-sequence semitic root extraction for enriching word embeddings

🎤 Invited Talks

💼 Work Experience