| CARVIEW |
David Heineman
Hey! I'm David ๐
I'm a pre-doctoral young investigator at the Allen Institute for AI, working to improve language model pre-training and evaluation.
๐ This fall I am applying to Ph.D. programs. I'm currently interested in the science of language modeling. I will be supported by the NSF CS Graduate Fellowship!
About Me
Building language models can, and should, be a rigorous science: I believe our fieldโs biggest bottleneck in doing so is the quality of our experimentation methodology [1] and the power of our evaluation signal [2]. This requires better measures of capability [3], new tools for observing how language models express behavior [4], and connecting tasks that are meaningful to our ability to learn, and generate, language [5, 6]. More in my statement โ
I work on these problems at Ai2 as part of the Open Language Model (OLMo) project, advised by Kyle Lo and Jesse Dodge. Previously, I completed my undergrad at Georgia Tech ๐, where I was fortunate to be advised by Prof. Wei Xu and work with Yao Dou and Mounica Maddela. I've also spent a few summers as an intern at AWS and at a healthcare startup Patientco. I enjoy reading, hiking, and making homebrew nitrogen cold brew. โ๏ธ โฐ๏ธ
Publications & Preprints
Olmo 3 [blog, code, models, data]
Olmo Team (incl. David Heineman)
preprint, 2025
Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation [code, data]
David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge
NeurIPS, 2025 (Spotlight, Top 5%)
Fluid Language Model Benchmarking [code, models]
Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge, Maarten Sap, Pang Wei Koh, Chun Wang, Hannaneh Hajishirzi, Noah A. Smith.
COLM, 2025 (Oral, Top 5%)
Establishing Task Scaling Laws via Compute-Efficient Model Ladders [code]
Akshita Bhagia*, Jiacheng Liu*, Alexander Wettig, David Heineman, Oyvind Tafjord, Ananya Harsh Jha, Luca Soldaini, Noah A. Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, Hannaneh Hajishirzi
COLM, 2025
2 OLMo 2 Furious [blog, code, models, data]
Pete Walsh*, Luca Soldaini*, Dirk Groeneveld*, Kyle Lo*, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, ..., David Heineman, ..., Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi
COLM, 2025
Evaluating LLMs on Chinese Idiom Translation
Cai Yang, Yao Dou, David Heineman, Xiaofeng Wu, Wei Xu
COLM, 2025
DataDecide: How to Predict Best Pretraining Data with Small Experiments [code, models]
Ian Magnusson*, Nguyen Tai*, Ben Bogin*, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge
ICML, 2025
Improving Minimum Bayes Risk Decoding with Multi-Prompt [code]
David Heineman, Yao Dou, Wei Xu
EMNLP, 2024
Towards a Path Dependent Account of Category Fluency [code]
David Heineman, Reba Koenen, Sashank Varma
CogSci, 2024
Thresh: Unified, Customizable and Deployable Fine-Grained Text Evaluation [live tool]
David Heineman, Yao Dou, Wei Xu
EMNLP Demo, 2023
Edit-level Simplification Evaluation using SALSA ๐ [code/data, metric]
David Heineman, Yao Dou, Mounica Maddela, Wei Xu
EMNLP, 2023
LENS: A Learnable Evaluation Metric for Text Simplification [code/data, metric]
Mounica Maddela*, Yao Dou*, David Heineman, Wei Xu
ACL, 2023
* = equal contribution
Some past work
- Winning submission to the Berghain challenge! [code] (๐ 1st of 1300 submissions)
- Participated in Thinking Machines' Tinker Beta. I experimented with RL training with terminal environments for reproducing empirical findings in ACL papers [code].
- I'm trying a new system for keeping up with fresh papers in our field [code] that updates every morning. It might be helpful for others, let me know if it is for you!
- Contributed to Terminal-Bench [leaderboard, docs], a challenging benchmark for language model agents using the CLI. I believe tbench's tmux envs are unique, new construct for our field!
- A few mini-projects: a 500 line GRPO implmenetation; showing LLM benchmark scores can improve +2 pts on MATH simply by changing vLLM version; a reproduction of branching factor; custom PyTorch kernels for Fast FFNs; and eval'ing LLMs on quant puzzles.
- Spent Summer '24 in the first US cohort of Entrepreneurs First in South Park, SF ๐ as part of a residency program. I briefly worked on a few ideas with RL for tool use, before moving to Ai2.
- Maintaining the Thresh ๐พ platform, an all-purpose tool for fine-grained text generation evaluation, including an annotation tool builder and Python library.
- Built a search engine [code] for ML / NLP conferences, indexed with ColBERT.
- Wrote a LLM-based Rubiks cube solver as a demonstration of explore/exploit behavior for reasoning (๐ 2nd place at AGI House open source hackathon).
- Awarded the GT College of Computing Outstanding Undergraduate Research Award (1 of 3000+ CS students) for my undergradaute thesis work on fine-grained evaluation of LLMs.
- Designed new programming assignments for CS 4650, Natural Language Processing as a teaching assistant (sampling algorithms & LLaMA fine-tuning with LoRA).
- Built an air pollution complaint tracker and classifier [code] for the Georgia Environmental Protection Divison (part of a larger collaboration at GT).
- Awarded the PURA research grant to work on open problems in generation & evaluation (check out my Huggingface decoding vizualizer extension).
- Thoughts on approaching reasoning evaluation in LLMs using theories of human cognition.
pip install lens-metric - A simple library to evalute text simplification using our LENS and LENS-SALSA LLMs on HuggingFace using only 5 lines of Python [demo].
- Interned in AWS EC2 Enterprise Services, developing a prototype language model service, addressing problems in inference cost and deployment of open-source LLMs.
- Earned 4th place in Georgia Tech's Wrek CTF (one of the largest greyhat hackathons in the southern US) [answers].
- Helped lead Georgia Tech's CS 3510, Design and Analysis of Algorithms as a teaching asssistant in Fall '21 and '22.
- Interned at AWS CloudWatch Application Insights, built infrastructure to monitor and group telemetry data from processes running on EC2 instances to identify the root causes of problems on customers' AWS infrastructure.
- Interned at Patientco (now part of Waystar), invented and deployed new sequence-based prediction models to predict when a patient pays their healthcare bill using their payment history (used to customize ~5% of U.S. healthcare bills).
- Deployed an API to allow researchers to segment Twitter hashtags using a new segmentation model from Georgia Tech's NLP Lab.
- In the pre GPT-3 times, worked on methods for automatically grading student essays [code].
pip install lens-metric - A simple library to evalute text simplification using our LENS and LENS-SALSA LLMs on HuggingFace using only 5 lines of Python [demo].Recommendations
A few interesting corners of the internet I find worth checking out!
...
... to flip through
Games, Puzzles, and Computation by Erik Demaine
The Corrections by Jonathan Franzen
Society Must be Defended by Michel Foucault
Oblivion by David Foster Wallace
I also enjoy trying new coffee shops. Here's some recommendations across Atlanta, that I visited during my undergrad, and a growing list across Seattle.
David Heineman
Last updated November 2025
[view source]
curl -s https://davidheineman.com/rick | bash