Inspecting and Editing Knowledge Representations in Language Models
Evan Hernandez, Belinda Z. Li, Jacob Andreas.
This repository provides an implementation of the Representation Mediation (REMEDI) method for autoregressive transformer language models.
All code is tested on MacOS Ventura (>= 13.1) and Ubuntu 20.04 using Python >= 3.10. It uses a lot of newer Python features, so the Python version is a strict requirement.
To run the code, create a virtual environment with the tool of your choice, e.g. conda:
conda create --name remedi python=3.10Then, after entering the environment, install the project dependencies:
python -m pip install invoke
invoke installWe cannot re-release the datasets used in the paper. However, you can download the raw datasets yourself and point our code to it:
CounterFact: Available on the ROME website. Note our code will automatically download this specific dataset for you.Bias in Bios: Must be downloaded using the official code release. When running a REMEDI script, set--dataset-file <pkl file>to point to the resulting pickle file.McRae Norms: Download the supplemental material of this paper and set--dataset-file <path to download>/CONCS_FEATS_concstats_brm.txt
All experiments from the paper can be run through invoke. To see the full list, run:
invoke --listAny task prefixed with an x. corresponds to an experiment. The invoke scripts have the hyperparameters from the paper baked into them. Most experiments support two flags: --device to specify the GPU, and --model to specify which LM to use (default: GPT-J).
The code supports training editors for most GPT variants: GPT2*, GPT-J, and GPT-NeoX (though Neo-X is too big with gradients for most single GPUs). In theory, the code also supports any autoregressive transformer LM, but this may need to slightly modify parts of determine_hidden_size and determine_layers inside the models module.
To run training with the default configuration, use invoke, e.g.:
invoke x.train.counterfact --device cudaFor more fine-grained control over the hyperparameters, run the training script directly, e.g.:
python -m scripts.train_editors \
-n my_custom_editors \
-m gptj \
-d counterfact \
-l 0 1 2 \
--lam-kl 100 \
--device cudaThe help strings for each command contain most of what you need to know.
After training editors, you can evaluate them on any of the benchmarks considered in the paper. If you trained them via invoke, this is as simple as running another invoke command, typically one prefixed with x.eval e.g.:
invoke x.eval.gen.counterfact --device cuda...which evaluate REMEDI on generation quality in counterfact.
Alterantively, as before, you can call the evaluation scripts directly.
python -m scripts.eval_fact_gen \
-n my_custom_eval \
-e results/my_custom_editors \
-m gptj \
-l 1 \
--device cudaWhile this library is not designed for industrial use (it's just a research project), we do believe research code should support reproducibility. If you have issues running our code in the supported environment, please open an issue on this repository.
If you find ways to improve our code, you may also submit a pull request. Before doing so, please ensure that the code type checks, lints cleanly, and passes all unit tests. The following command should exit cleanly:
invoke presubmit@InProceedings{hernandez2023remedi,
title = {Inspecting and Editing Knowledge Representations in Language Models},
author = {Hernandez, Evan and Li, Belinda Z. and Andreas, Jacob},
booktitle = {Arxiv},
year = {2023},
url = {https://arxiv.org/abs/2304.00740}
}