1University of Washington, 2Allen Institute for Artificial Intelligence
We retrieve 1,000 journal or conference papers from each of 10 scientific domains using the Semantic Scholar API. For each paper, we also collect its citing papers, forming our raw corpus.
We filter out papers that lack citation information or abstracts, then regroup the remaining papers based on the knowledge cutoff date of a given model and the publication dates of the papers. This process yields 5,148 triplets of (prior paper, new paper, future paper). For each paper, we synthetically generate one SUPPORT claim (a uniquely supporting scientific claim) and one REFUTE claim (a relevant but non-supporting scientific claim). The resulting dataset is available in the filtered_with_claims folder.
The eval_judgment.py and eval_generation.py scripts are used to evaluate a specific type of scientific knowledge in the model, assumed to be a knowledge-updated version of the basemodel. If the model and basemodel are the same, the evaluation is performed on the basemodel. The --portion argument allows control over the fraction of the dataset used for evaluation.
# example
python eval_judgment.py \
--basemodel llama \
--model llama \
--domain computer_science \
--knowledge new \
--portion 0.8# example
python eval_generation.py \
--basemodel olmo32b \
--model _ar_testdoc \
--domain education \
--knowledge future \
--portion 1.0The metrics.py script computes all eight evaluation metrics introduced in the paper, based on evaluation results obtained before (basemodel) and after (model) a knowledge update. The model is assumed to be a knowledge-updated version of the basemodel, using a specified update method (e.g., _ar_traintestdoc_it_trainqa).
# example
python metrics.py \
--basemodel llama \
--model _ar_traintestdoc_it_trainqa \
--domain political_science \
--task judgmentThe following examples show how to run training baselines using llama as the base model and computer_science as the target domain.
python ar.py -bm llama -m llama -d computer_science -ds testdoc
# Output model will be saved as: llama/computer_science/_ar_testdocpython ar.py -bm llama -m llama -d computer_science -ds traintestdoc
python it.py -bm llama -m _ar_traintestdoc -d computer_science -ds trainqa
# Output model will be saved as: llama/computer_science/_ar_traintestdoc_it_trainqapython it.py -bm llama -m llama -d computer_science -ds trainqadoc
python ar.py -bm llama -m _it_trainqadoc -d computer_science -ds testdoc
# Output model will be saved as: llama/computer_science/_it_trainqadoc_ar_testdocIf you have any questions or comments about our paper, data, or scripts, or if you notice any issues in the code, feel free to reach out via email at yikewang@cs.washington.edu. We will do our best to respond within one business day.
If you found this work helpful, please consider starring this repository and citing our paper as shown below:
@article{wang2025sciencemeter,
title={ScienceMeter: Tracking Scientific Knowledge Updates in Language Models},
author={Wang, Yike and Feng, Shangbin and Tsvetkov, Yulia and Hajishirzi, Hannaneh},
journal={arXiv preprint arXiv:2505.24302},
year={2025}
}