Carview!

ScienceMeter: Tracking Scientific Knowledge Updates in Language Models

Yike Wang¹, Shangbin Feng¹, Yulia Tsvetkov¹, Hannaneh Hajishirzi¹²
¹University of Washington, ²Allen Institute for Artificial Intelligence

Dataset

We retrieve 1,000 journal or conference papers from each of 10 scientific domains using the Semantic Scholar API. For each paper, we also collect its citing papers, forming our raw corpus.

We filter out papers that lack citation information or abstracts, then regroup the remaining papers based on the knowledge cutoff date of a given model and the publication dates of the papers. This process yields 5,148 triplets of (prior paper, new paper, future paper). For each paper, we synthetically generate one SUPPORT claim (a uniquely supporting scientific claim) and one REFUTE claim (a relevant but non-supporting scientific claim). The resulting dataset is available in the filtered_with_claims folder.

Evaluation of Scientific Knowledge

The eval_judgment.py and eval_generation.py scripts are used to evaluate a specific type of scientific knowledge in the model, assumed to be a knowledge-updated version of the basemodel. If the model and basemodel are the same, the evaluation is performed on the basemodel. The --portion argument allows control over the fraction of the dataset used for evaluation.

Claim Judgment Task

# example
python eval_judgment.py \
  --basemodel llama \
  --model llama \
  --domain computer_science \
  --knowledge new \
  --portion 0.8

Claim Generation Task

# example
python eval_generation.py \
  --basemodel olmo32b \
  --model _ar_testdoc \
  --domain education \
  --knowledge future \
  --portion 1.0

Evaluation of knowledge Update Methods

The metrics.py script computes all eight evaluation metrics introduced in the paper, based on evaluation results obtained before (basemodel) and after (model) a knowledge update. The model is assumed to be a knowledge-updated version of the basemodel, using a specified update method (e.g., _ar_traintestdoc_it_trainqa).

# example
python metrics.py \
  --basemodel llama \
  --model _ar_traintestdoc_it_trainqa \
  --domain political_science \
  --task judgment

Knowledge Update Baselines

The following examples show how to run training baselines using llama as the base model and computer_science as the target domain.

Continual Pre-training

python ar.py -bm llama -m llama -d computer_science -ds testdoc
# Output model will be saved as: llama/computer_science/_ar_testdoc

Standard Instruction-tuning

python ar.py -bm llama -m llama -d computer_science -ds traintestdoc
python it.py -bm llama -m _ar_traintestdoc -d computer_science -ds trainqa
# Output model will be saved as: llama/computer_science/_ar_traintestdoc_it_trainqa

Pre-instruction-tuning

python it.py -bm llama -m llama -d computer_science -ds trainqadoc
python ar.py -bm llama -m _it_trainqadoc -d computer_science -ds testdoc
# Output model will be saved as: llama/computer_science/_it_trainqadoc_ar_testdoc

Questions

If you have any questions or comments about our paper, data, or scripts, or if you notice any issues in the code, feel free to reach out via email at yikewang@cs.washington.edu. We will do our best to respond within one business day.

Citing

If you found this work helpful, please consider starring this repository and citing our paper as shown below:

@article{wang2025sciencemeter,
  title={ScienceMeter: Tracking Scientific Knowledge Updates in Language Models},
  author={Wang, Yike and Feng, Shangbin and Tsvetkov, Yulia and Hajishirzi, Hannaneh},
  journal={arXiv preprint arXiv:2505.24302},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
dataset		dataset
prompts		prompts
README.md		README.md
ar.py		ar.py
eval_generation.py		eval_generation.py
eval_judgment.py		eval_judgment.py
it.py		it.py
lm_utils.py		lm_utils.py
metrics.py		metrics.py
parsing.py		parsing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ScienceMeter: Tracking Scientific Knowledge Updates in Language Models

Dataset

Evaluation of Scientific Knowledge

Claim Judgment Task

Claim Generation Task

Evaluation of knowledge Update Methods

Knowledge Update Baselines

Continual Pre-training

Standard Instruction-tuning

Pre-instruction-tuning

Questions

Citing

About

Uh oh!

Releases

Packages

Languages

yikee/ScienceMeter

Folders and files

Latest commit

History

Repository files navigation

ScienceMeter: Tracking Scientific Knowledge Updates in Language Models

Dataset

Evaluation of Scientific Knowledge

Claim Judgment Task

Claim Generation Task

Evaluation of knowledge Update Methods

Knowledge Update Baselines

Continual Pre-training

Standard Instruction-tuning

Pre-instruction-tuning

Questions

Citing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages