This repository contains implementation details for our paper AbsenceBench: Language Models Can't Tell What's Missing. AbsenceBench is a new benchmark designed to evaluate the abilities of LLMs in locating conspicuously missing information from long inputs. Instead of asking LLMs to find off-topic information (the ‘needle’ in NIAH), LLMs are prompted to identify and recall intentionally omitted information.
This repo provides instructions on how to generate the AbsenceBench dataset and run the evaluation.- Python 3.6 or higher installed
- pip (Python package installer)
python3 -m venv venv
source venv/bin/activate
You'll know your virtual environment is active when you see (venv)
at the beginning of your terminal prompt.
Once your virtual environment is activated, install the required packages:
pip install -r requirements.txt
We run all evaluation through API requests. If you would like to do so as well, you will need to install these corresponding packages.
pip install openai # Openai API (GPT-4, o3), xAI API (Grok)
# OPENAI_API_KEY, XAI_API_KEY
pip install anthropic # Anthropic API (Claude)
# ANTHROPIC_API_KEY
pip install together # TogetherAI API (Llama, Qwen, Deepseek, Mixtral)
# TOGETHER_API_KEY
pip install google-genai # Google API (Gemini)
# GEMINI_API_KEY
Note: You'll need to set up the appropriate API keys as environment variables. Here are some instructions.
AbsenceBench covers three distinct domains:
- Poetry (realistic; 1191 instances)
- Numerical sequences (synthetic; 1200 instances)
- GitHub pull requests (realistic; 887 instances)
There are 4302 instances in total, with an average context length of 5K tokens.
We host the dataset on this Huggingface repo.
You can directly download the data by running the script below. The script retrieves the default branch of the dataset, containing one jsonl
file per domain, and stores it in the data
directory. The dataset requires approximately 37.8 MB of storage.
bash scripts/download.sh
Note that it is recommended to choose this way of downloading data to perform the evaluations provided in this repository.
Alternatively, you can download data using 🤗 Datasets.
from datasets import load_dataset
dataset = load_dataset("harveyfin/AbsenceBench", "poetry")
We provide python scripts as well as the source data for generating the full dataset.
bash scripts/download_poetry.sh
bash scripts/generate_data.sh
Note that scraping the GitHub pull requests data may take a long time (around 20 minutes)
If you wish to evaluate a language model via API, we have provided frameworks for five API providers in tests/llm_providers.py
. The following is an example script to run evaluations using Claude-3.7-sonnet
:
python evaluate.py \
--model_family anthropic \ # model family (e.g., openai, anthropic)
--model claude-3.7-sonnet-latest \ # model API reference
--in_dir tests\ # directory of evaluation scripts
--out_dir results\ # directory of outputs
--batch_size 10 \ # batch size
--thinking # (optional) thinking mode
Alternatively, to evaluate your own model, modify the get_response
function here and specify "custom" as the model family in the above script.
evaluate.py
executes three distinct test scripts (one for each domain) located under tests
directory. You can also pass --run_task poetry
to the above script or directly run the specific test script
python tests/test_llms_poetry.py \
--input_file data/poetry.jsonl \ # path to the poetry data
--provider_models openai:gpt-4 \ # model_family:model
--output poetry_gpt-4.jsonl \ # path to save the output
--batch_size 10 \ # batch size
--sample_size 5 \ # (optional) run on several samples only
--thinking \ # (optional) thinking mode
We evaluate a total of 14 LLMs on AbsenceBench
In the paper, we perform several analyses on AbsenceBench. This section provides further details regarding data generation and evaluation procedures used in these analyses.
We compare our evaluation setting to the NIAH test setting in the Poetry and GitHub PRs domains. To generate data for these two domains under the NIAH setting, run the data generation scripts under dataset_construction
directory separately with the --use_needles
argument enabled. In addition, you will need to add your own "needles" file to the directory. Example usage that saves data to data/poetry_needles.jsonl
python dataset_construction/process_poetry.py \
--input_file data/poetry_raw.jsonl \
--prob 0.1 \
--use_needles \
Note for Github PRs domain, you will need to modify here by enabling the --use_needles
argument.
Similar to the evaluation scripts, pass the --use_needle
argument to evaluate AbsenceBench under the NIAH setting.
We analyze the effect of placeholders as an identifier to help language models detect omissions. To generate data with placeholders in place, enable the --use_placeholders
argument. Evaluation is performed in the default AbsenceBench task setting.
If you find this work useful, please cite our paper:
@misc{fu2025absencebenchlanguagemodelscant,
title={AbsenceBench: Language Models Can't Tell What's Missing},
author={Harvey Yiyun Fu and Aryan Shrivastava and Jared Moore and Peter West and Chenhao Tan and Ari Holtzman},
year={2025},
eprint={2506.11440},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.11440},
}
If you have any questions regarding this repo, or questions that is relevant to AbsenceBench, please email me at harveyfu@uchicago.edu
.