| CARVIEW |
harveyfu [at] uchicago.edu
(Harvey) Yiyun Fu
Welcome! I am a first year Ph.D. student in Computer Science at the Univeristy of Chicago, advised by Prof. Ari Holtzman and within UChicago C&I. Prior to UChicago, I received B.S. in Economics and Mathematics from the University of Southern California.
I have a broad interest in natural language processing and machine learning. My current research focus lies in understanding language model capabilities, as well as building generalizable and robust NLP systems. Recently, I am interested in LLM mathematical reasoning hallucinations.
I am extremely grateful to all of my mentors and collaborators who warmly guided me into the field of NLP Research.
[CV] [Twitter] [Google Scholar]
Recent News
- [Sep 2025] We released a communication game featuring writing emails to navigate tricky corporate scenarios. Check it out!
- [Sep 2025] AbsenceBench is accepted to NeurIPS 2025 (Dataset and Benchmark Track) as a Spotlight paper.
- [Jun 2025] We released a new benchmark that tests language models' capacity to detect missing information over long-context inputs.
Publications
-
AbsenceBench: Language Models Can't Tell What's Missing
Harvey Yiyun Fu, Aryan Shrivastava, Jared Moore, Peter West, Chenhao Tan, and Ari Holtzman.
arxiv preprint
[abstract] [code]Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to "gaps" in documents since these absences don't correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).
-
Estimating Large Language Model Capabilities without Labeled Test Data
Harvey Yiyun Fu, Qinyuan Ye, Albert Xu, Xiang Ren, and Robin Jia.
Findings of EMNLP, 2023.
[abstract] [code]Large Language Models (LLMs) have the impressive ability to perform in-context learning (ICL) from only a few examples, but the success of ICL varies widely from task to task. Thus, it is important to quickly determine whether ICL is applicable to a new task, but directly evaluating ICL accuracy can be expensive in situations where test data is expensive to annotate---the exact situations where ICL is most appealing. In this paper, we propose the task of ICL accuracy estimation, in which we predict the accuracy of an LLM when doing in-context learning on a new task given only unlabeled test data for that task. To perform ICL accuracy estimation, we propose a method that trains a meta-model using LLM confidence scores as features. We compare our method to several strong accuracy estimation baselines on a new benchmark that covers 4 LLMs and 3 task collections. The meta-model improves over all baselines across 8 out of 12 settings and achieves the same estimation performance as directly evaluating on 40 collected labeled test examples per task. At the same time, no existing approach provides an accurate and reliable ICL accuracy estimation in every setting, highlighting the need for better ways to measure the uncertainty of LLM predictions.
-
How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench
Qinyuan Ye, Harvey Yiyun Fu, Xiang Ren, and Robin Jia.
Findings of EMNLP, 2023.
[abstract] [code]We investigate the predictability of large language model (LLM) capabilities: given records of past experiments using different model families, numbers of parameters, tasks, and numbers of in-context examples, can we accurately predict LLM performance on new experiment configurations? Answering this question has practical implications for LLM users (e.g., deciding which models to try), developers (e.g., prioritizing evaluation on representative tasks), and the research community (e.g., identifying hard-to-predict capabilities that warrant further investigation). We study the performance prediction problem on experiment records from BIG-bench. On a random train-test split, an MLP-based predictor achieves an R^2 score greater than 95%, indicating the presence of learnable patterns within the experiment records. We then formulate the problem of searching for "small-bench," an informative subset of BIG-bench tasks from which the performance on the full set can be maximally recovered. We find a subset as informative as BIG-bench Hard for evaluating new model families, while being 3 times smaller. We also find competitive subsets by clustering task representations learned by the MLP-based predictor, highlighting the importance of task diversity in constructing "small-bench."
Teaching
- CMSC25300: Mathematical Foundations of Machine Learning - Spring 2025, Fall 2025
- ITP115: Programming in Python - Fall 2022, Spring 2023, Fall 2023
Miscellany
- I play basketball in my free time. I am also a semi-fluent bamboo flute player.
- I enjoy reading science-fiction as a way to think outside the box, especially for those with Artificial Intelligence elements. Often times I take that as a research motivation.
- This website is adapted from Nelson's.