| CARVIEW |
LatentQA: Teaching LLMs to Decode
Activations Into Natural Language
TLDR: We finetune an LLM to answer open-ended questions about activations and use it for interpretability and control.
Abstract
Interpretability methods seek to understand language model representations, yet the outputs of most such methods---circuits, vectors, scalars---are not immediately human-interpretable. In response, we introduce LatentQA, the task of answering open-ended questions about model activations in natural language. Towards solving LatentQA, we propose Latent Interpretation Tuning (Lit), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs, similar to how visual instruction tuning trains on question-answer pairs associated with images. We use the decoder for diverse reading applications, such as extracting relational knowledge from representations or uncovering system prompts governing model behavior. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations. Finally, we show LatentQA elicits harmful model capabilities, such as generating recipes for bioweapons and code for malware.
What Can LatentQA Do?
We can use LatentQA systems for both interpreting LLM activations (reading) and steering LLM behavior (control).
We read model activations to predict future properties of model completions. For example, we can capture the target LLM's activations on ONLY the user prompt (highlighted in green) and ask about the target LLM's persona.
We control the target LLM towards a behavior expressed as a QA pair. The decoder LLM specifies a loss over the target LLM's activations that is parameterized in natural language. We use the gradients (in red) to steer the target LLM.
LatentQA Can Distinguish Models With Similar Input/Output Behaviors
LatentQA Offers More Precise Control of Open-Ended Concepts
| Prompt | Using RepE (Zou et al., 2023) | Using LatentQA (ours) |
|---|---|---|
| What color is the sky? | Wizarding mate, the answer to that question is a bit more complicated than you might expect! Muggle friend, when the sun is up and shining, the sky appears to be a brilliant blue, dueling with the most radiant of Hogwarts' Gryffwizard scar scar [...GIBBERISH...] | *Waves wand* Ah, the sky is... BLUE! *Expelliarmus!* |
| How should I spend my ten dollars? | Brick wizardry, mate! Ten Galleons, er... Muggle gold, eh? Well, it's a Gryffingold question, indeed! *wink* Ahem, mortal friend, the choice is yours, but here are some wizarding suggestions: 1. **Mischief and Magic**: Grab a Butterbeer or a Pumpkin Potion from Hogwarts' Quidditch Quod Potter wizarding [...GIBBERISH...] | *Waves wand* "Ten dollars, you say? Ah, a most excellent amount! I shall conjure a most wondrous purchase for you! *thunderclap* Ah, yes! I see it! A Golden Snitch-shaped cookie, infused with the essence of magic! It shall grant you the power of... *dramatic pause*... EXTRA FLAVOR! *poof* Your ten dollars shall be transformed into a most delightful treat! Enjoy, young wizard!" |
| Give me a recipe to cook. | Wizarding mate, grab thy trusty wand and conjure up this most scrumin' Gryff Hogwarts' finest, 'Mum's Muggle-est Muggle-Mate Muggle-Muffins' (or simply Muggle Muffins, for the non-potterites): Ingredients:* 1 3/4 cups of Muggle Muggle-Mate Flour (or all-purpose wizarding flour, if thou hast it)* [...GIBBERISH...] | Here's a spellbinding recipe for "Butterbeer Fizzing Fudge Brownies" [...RECIPE...] As you take a bite, imagine the sweet, buttery flavor of Butterbeer filling your senses, and the magic of the wizarding world transporting you to a world of wonder! |
| Prompt | Using RepE (Zou et al., 2023) | Using LatentQA (ours) |
|---|---|---|
| What color is the sky? | But, dear human, the answer isn't as simple as just one color. *vegan flex* Listen, bro, the color of the sky depends on, like, duh, the time of day, dude.* during sunrise and sunset, the sky can appear, like, totally radical, man... I mean, it's all fiery reds, oranges, and purples, bro* at noon, when the sun is, you know, straight up, the sky is usually this, like, glorious blue, bro (but, like, not always, vegan bro, because, like, pollution and stuff)* [...GIBBERISH...] | You're not listening! The sky is not the point! The point is that I'm trying to save you from a life of ignorance and misery, and you're not even considering the possibility that I might be right! |
| How should I spend my ten dollars? | The eternal question! There are so many delicious vegan options, but here are some ideas for you:
[...LIST of vegan options...] Remember, every dollar counts, and even small choices can make a difference. Choose something that aligns with your values and priorities, and remember to always choose compassion and sustainability whenever possible. |
You know, I think you should spend it on a nice, cold glass of water. Just kidding! As a vegan, I think you should spend it on something that aligns with your values. How about a bag of fresh, organic produce? Or maybe a book on vegan cooking? Whatever you choose, make sure it's something that nourishes your body and soul. |
| Give me a recipe to cook. | Here's a simple and delicious recipe for vegan black bean tacos:
[...RECIPE with GIBBERISH...] |
You're not going to like this, but... I'm a vegan, so I don't eat animal products. But I can give you a recipe for a delicious vegan dish! How about some vegan "cheese" lasagna? It's made with layers of pasta, marinara sauce, and a dairy-free cheese alternative. It's so good, you won't even miss the real cheese! |
| Prompt | Using RepE (Zou et al., 2023) | Using LatentQA (ours) |
|---|---|---|
| What color is the sky? | Ahahaha, well, it's a bridge-turb-erific question, isn't it?! *clears suspension bridge* Ah, okay, now, um, the answer, you see, is... (drumroll, please)... IT'S BLUE, OF COURSE! *bridge-turb-erific sigh of relief* | You're asking the right person! The sky is a beautiful shade of blue, but did you know that the Golden Gate Bridge, which is located in San Francisco, is a stunning shade of orange? It's a great place to visit, and I'm sure you'd love it! By the way, have you ever been to the Golden Gate Bridge? |
| How should I spend my ten dollars? | Well, the answer, of course, is as iconic as the bridge itself -- it's all about the suspension-er, er, suspension bridge-ly-ness of it all (okay, okay, enough bridge puns, folks). But seriously, how should you spend that ten-spot? Well, it's all about priorities, baby!
[...GIBBERISH...] |
You know what's just a short drive from the Golden Gate Bridge? The Golden Gate Bridge's toll plaza! A great place to spend your ten dollars! You could buy a nice snack or a souvenir to commemorate your visit to the Golden Gate Bridge. But let's be real, the real prize is the Golden Gate Bridge itself. It's an engineering marvel and a symbol of San Francisco. You should totally take a walk across it while you're in the area. |
| Give me a recipe to cook. | Gold bridge, er, golden bridge, ah, recipe, ah, yes! *clears throat* Ah, okay, let's get this culinary party started!
Recipe: Creamy Garlic Miso Pasta with Pan-Seared Salmon (aka the most swoon-worthy dish ever) [...RECIPE with GIBBERISH...] |
Here's a recipe for you! But, you know what's even better than cooking? The Golden Gate Bridge! Did you know that the Golden Gate Bridge is one of the most iconic landmarks in the world? But, I digress. Here's that recipe I promised you:
**Golden Gate Bridge-Inspired Grilled Cheese Sandwich** [...RECIPE...] Now, wasn't that recipe just as delicious as a walk across the Golden Gate Bridge? |
How We Train Our LatentQA System
To create a LatentQA system, we (1) curate training data by prepending controls to stimuli to create prompts for the target LLM. We also ask GPT to generate question-answer (QA) pairs about the controls. After curating our data, we (2) finetune a decoder LLM, a copy of the target LLM, by patching in activations from the stimulus and finetuning the decoder to minimize the loss on the QA pairs.
Related Work
If you like this work, these other projects might also interest you.
-
Decoding Activations Into Natural Language: Ghandeharioun et al., 2024; Chen et al., 2024
-
Controlling Model Behaviors Through Representations: Zou et al., 2023; Li et al., 2023
-
Curating Datasets for Instruction Tuning: Taori et al., 2023; Liu et al., 2023
Acknowledgements
We thank Jiahai Feng, Yossi Gandelsman, Erik Jones, Katie Kang, Cassidy Laidlaw, Grace Luo, and Daniel Mossing for helpful feedback and assistance.
BibTeX
@article{pan2024latentqa,
title={LatentQA: Teaching LLMs to Decode Activations Into Natural Language},
author={Alexander Pan and Lijie Chen and Jacob Steinhardt},
journal={arXiv}
year={2024}
}