Dated Data: Tracing Knowledge Cutoffs in Large Language Models

This repository accompanies the paper Dated Data: Tracing Knowledge Cutoffs in Large Language Models, Outstanding Paper at COLM 2024.

Overview

Large Language Models (LLMs) are often paired with a reported cutoff date, the time at which training data was gathered. However, does the model's demonstrated knowledge align to its cutoff date? We define the notion of an effective cutoff, which indicates when the model's knowledge is most concentrated, and is different from the reported cutoff. We propose a simple approach to estimate effective cutoffs of an LLM on the resource-level by probing across versions of the data. Crucially, our method does not require access to a model's pre-training data. Through our analysis, we find that effective cutoffs often drastically differ from reported cutoffs. This repository contains our results, as well as the code to replicate them.

Data Collection

We provide a .csv file of the 5000 most popular Wikipedia topics used in our analysis. Additionally, the pipeline to scrape the versions of those topics can be run as follows:

  python get_revision_ids.py {{csv file of most popular topics}}
  python get_content_for_month.py {{csv file of revision ids}} {{location to save scraped Wikipedia documents}}

Perplexity Measurements

We provide the perplexities of the versions of the 5000 most popular Wikipedia topics spanning from 2016 - 2023 in ./perplexities/ separated for each model. Moreover, we provide the code used to generate these perplexities. Note that the data in the provided .csv files are the averaged negative log likelihoods, exponentiate to get perplexity. Fill out config.yaml with the relevant paths and run:

  python get_ppls.py --config_file ./config.yaml

Citations

If you find this work helpful, please consider citing:

@misc{cheng2024dateddatatracingknowledge,
      title={Dated Data: Tracing Knowledge Cutoffs in Large Language Models}, 
      author={Jeffrey Cheng and Marc Marone and Orion Weller and Dawn Lawrie and Daniel Khashabi and Benjamin Van Durme},
      year={2024},
      eprint={2403.12958},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2403.12958}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
images		images
newspan		newspan
perplexities		perplexities
README.md		README.md
config.yaml		config.yaml
get_content_for_month.py		get_content_for_month.py
get_ppls.py		get_ppls.py
get_revision_ids.py		get_revision_ids.py
most_popular_topics.csv		most_popular_topics.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Overview

Data Collection

Perplexity Measurements

Citations

About

Uh oh!

Releases

Packages

Languages

nexync/dated_data

Folders and files

Latest commit

History

Repository files navigation

Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Overview

Data Collection

Perplexity Measurements

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages