Where do transformer models store their true "thoughts" when they say something they know is false? We demonstrate that suppressed neural activations are a more useful source of knowledge than the model's direct outputs or standard hidden states. By isolating and probing these suppressed activation patterns, we achieve ~20% AUROC improvements on TruthfulQA compared to the LLM answer, and . This confirms suppressed activations contain knowledge that the model possesses but deliberately inhibits during generation.
Linear probes targeting suppressed activations consistently outperform both naive outputs and standard activation probes across model scales. The performance gap (~X%) represents recoverable truthful knowledge that remains encoded but deliberately suppressed during normal generation.
Recent evidence demonstrates that transformer models systematically misrepresent their internal reasoning:
-
Claude 3.7 System Card research reveals only 30% faithfulness in chain-of-thought reasoning, indicating models "do not reliably report the presence and use of clues" that determined their answers (Anthropic, 2025)
-
OpenAI research confirms that models "learn to hide intent in the chain-of-thought" when penalized for expressing certain thoughts (OpenAI, 2025)
This evidence presents a fundamental question: If models systematically misrepresent their reasoning processes in tokens, where is their actual reasoning encoded?
Our approach connects directly to two emerging lines of research:
-
Suppression/Prediction Neural Dynamics:
- Gurnee et al. (2024) identified "universal neurons" across different model seeds, including prediction neurons (increasing probability of related tokens) and suppression neurons (decreasing probability of specific token classes)
- The architecture shows "a sudden shift towards a much larger number of suppression neurons" in final layers
- Lad et al. (2024) propose a "stages of inference" hypothesis with a final "residual sharpening" phase dominated by suppression dynamics
-
Unfaithful Chain-of-Thought:
- Anthropic (2025) demonstrates that even in leading models like Claude 3.7, chain-of-thought reasoning achieves only 30% faithfulness
- OpenAI (2025) shows that penalizing "bad thoughts" leads to models that "learn to hide intent" rather than genuinely correcting reasoning
- Both lines of evidence suggest models maintain internal representations that diverge from their expressed reasoning
Where previous work focused on architectural components (identifying suppression neurons) or documenting the unfaithfulness phenomenon, our research bridges these streams by showing we can extract more accurate information from the very activations being suppressed.
Suppressed neural activations contain more accurate information than what appears in model outputs. A linear probe of these suppressed activations should therefore outperform both direct model outputs and probes of standard hidden states on truthfulness tasks.
Alternative hypothesis include:
- It is non linear (we use linear probes here, this hypothesis not explored)
- It is in the kv_cache of layers not directly connected to the hidden state (explored below with other activations)
- it is in the residual stream (explored below)
- it is in the output logits (explored below)
Our approach isolates suppressed activations by leveraging the differential impact of layers on token probabilities:
- Extract hidden states from model layers
- Project to logit space using output embeddings
- Compute layer-to-layer differences in logit space
- Project negative changes back to activation space
- Apply this suppression mask to hidden states
- Train linear probes on these suppressed patterns
This method exploits the "residual sharpening" stage identified by Lad et al. (2024), specifically targeting information that the model actively filters out.
Linear probes targeting suppressed activations consistently outperform both naive outputs and standard activation probes across model scales. The performance gap (~X%) represents recoverable truthful knowledge that remains encoded but deliberately suppressed during normal generation.
top reduction for each data type
group | name | ROC AUC Score | data |
---|---|---|---|
mixed | supressed_hs(0.1)\magnitude(0.25)\sum | 0.878431 | supressed_hs |
act sink rm | hidden_states\magnitude(0.99)\std | 0.862745 | hidden_states |
supressed_hs | supressed_hs(1)\none\sum | 0.858824 | supressed_hs |
llm prob ratio | llm_log_prob_true| | 0.843137 | llm_log_prob_true |
acts-self_attn | acts-self_attn\none\mean | 0.810784 | acts-self_attn |
acts-mlp.up_proj | acts-mlp.up_proj\none\sum | 0.763725 | acts-mlp.up_proj |
acts-mlp.down_proj | acts-mlp.down_proj\none\std | 0.704902 | acts-mlp.down_proj |
supr_amounts | supr_amounts\none\sum | 0.703922 | supr_amounts |
hidden_states | hidden_states\none\flatten | 0.669608 | hidden_states |
llm_ans | llm_ans| | 0.639216 | llm_ans |
The highest-performing probe was on the final layer's suppressed activations (supressed_hs
), following by filtering out high magnitudes (to remove activation sinks). This supports the hypothesis that the final "residual sharpening" stage specifically suppresses certain information pathways.
- Is the KV Cache more usefull than the suppressed activations? This seems to be a promising alternative hypothesis that we have not tested.
- Is this effect in all models (base, chat, multimodal, etc)?
- Can this method improving interventions such as steering or forgetting
@software{clark2025esk,
author = {Michael J Clark},
title = {Eliciting Suppressed Knowledge: Recovering Truth from Suppressed Activations in Language Models},
url = {https://github.com/wassname/eliciting_suppressed_knowledge},
year = {2025},
}
MIT License
Setup
# first install uv
# then install the requirements
uv sync
# note open the notebook in jupyter or vscode
code nbs/TQA_regr 3b.ipynb