CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Fri, 27 Jun 2025 22:47:42 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"685f1f8e-5971" expires: Mon, 29 Dec 2025 14:58:45 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 99B9:2BC55:8F3A93:A0BAA4:695294CD accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 14:48:45 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210050-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767019726.589448,VS0,VE221 vary: Accept-Encoding x-fastly-request-id: 8403ff691c659d385e25b9106cf2afe232ec06bf content-length: 6862 MIB – Project Page

MIB: A Mechanistic Interpretability Benchmark

Aaron Mueller^*,1,2, Atticus Geiger^*,3, Sarah Wiegreffe⁴, Dana Arad², Iván Arcuschin⁵, Adam Belfki¹, Yik Siu Chan⁶, Jaden Fiotto-Kaufman¹, Tal Haklay², Michael Hanna⁷, Jing Huang⁸, Rohan Gupta⁵, Yaniv Nikankin², Hadas Orgad², Nikhil Prakash¹, Anja Reusch², Aruna Sankaranarayanan⁹, Shun Shao¹⁰, Alessandro Stolfo¹¹, Martin Tutek², Amir Zur³, David Bau¹, Yonatan Belinkov²
¹Northeastern University ²Technion – IIT ³Pr(Ai)²R Group ⁴Ai2 ⁵Independent ⁶Brown University ⁷University of Amsterdam ⁸Stanford University ⁹Massachusetts Institute of Technology ¹⁰University of Cambridge ¹¹ETH Zürich

Paper Data Code Leaderboard

Abstract

How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or specific causal variables in neural language models. The circuit localization track compares methods that locate the model components—and connections between them—most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAE) or distributed alignment search (DAS), and locate model features for a causal variable relevant to the task. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAEs features are not better than neurons, i.e., standard dimensions of hidden vectors. These findings illustrate that MIB enables meaningful comparisons of MI methods, and increases our confidence that there has been real progress in the field.

Key Contributions

New metrics: Two integrated faithfulness metrics for evaluating circuit discovery methods
New model: A model with a ground-truth circuit
Standard datasets and counterfactuals: Tasks and causal variables of varying difficulties and required reasoning types
Novel scientific insights: Edge-level circuits outperform node-level; attribution and mask learning methods are best for circuit discovery; DAS performs well and establishes that there are linear features that realize causal variables, but standard dimensions of hidden vectors are better units of analysis than SAE features.

Motivation

Mechanistic interpretability (MI) methods allow us to understand why language models (LMs) behave the way they do. MI methods have been proliferating quickly, but it's difficult to compare the efficacy of MI methods. How can we know whether true methods are producing real advancements over prior work? We propose MIB as a stable standard.

Types of MI Methods

We view most MI methods as performing either localization or featurization (or both). We split these two functions into two tracks: the circuit localization track, and the causal variable track.

Materials

Data

Both tracks evaluate across four tasks. These are selected to represent various reasoning types, difficulty levels, and answer formats.

Indirect Object Identification (IOI)
Multiple-choice Question Answering (MCQA)
Arithmetic
AI2 Reasoning Challenge (ARC)

Two of these tasks (IOI and Arithmetic) were chosen because they have been extensively studied. The others (MCQA and ARC) were chosen because they have not.

Models

We include models of diverse capability levels and sizes:

GPT-2 Small
Qwen-2.5 (0.5B)
Gemma-2 (2B)
Llama-3.1 (8B)

Circuit Localization Track

Given a task, a circuit is the subset of the computation graph that performs the task.

Metrics

Past circuit discovery work often uses faithfulness. This is good for measuring the quality of a single circuit, but how do we measure the quality of a circuit discovery method?

Furthermore, people often mean one of two things by this: (i) the subgraph that is responsible for performing the task well, or (ii) the smallest subgraph that replicates the model's behavior (including its failures).

Thus, we propose two metrics: the integrated circuit performance ratio (CPR; higher is better), and the integrated circuit-model difference (CMD; 0 is best). CPR is basically the area under the faithfulness curve at many circuit sizes. CMD is the area between the faithfulness curve and 1, where 1 indicates that the circuit and model have the exact same task behavior (with respect to what is being measured).

Illustration of CPR (area under the faithfulness curve) and CMD (area between the faithfulness curve and 1).

An issue with faithfulness is that it's not clear what the lower or upper bounds are. Thus, we include a fifth model for this track: an InterpBench model. This is a model that we train to contain a known ground-truth circuit. Because we know what the edges are, we can compute the AUROC over the edges at many circuit sizes.

Baselines

We evaluate a variety of methods, including:

Activation patching
Gradient attribution methods
Mask learning methods
Information flow routes
Edge-level and node-level circuit discovery

Results

CMD scores (closer to 0 is better), and AUROC scores (for InterpBench only; higher is better).

Attribution patching with integrated gradients (*AP-IG) outperforms attribution patching (*AP), and most other methods.

Edge-level circuits (E*) outperform node-level circuits (A*).

Patching with activations from counterfactual inputs (CF) outperforms other common patching methods.

UGS, a mask-learning method, performs well.

Causal Variable Localization Track

In this track, the goal is to align model representations with specific known causal variables.

Submissions

A submission will align a causal variable in a model that solves the task with features of a hidden vector. For each layer, a submission can provide a hidden vector, a featurizer, and a set of features the variable is aligned to.

Example of of an alignment that would be provided in a submission. In the arithmetic task, the high-level causal model has a carry-the-one variable that is aligned with features of a hidden vector.

Metrics

We want to evaluate the quality of a featurizer, a transformation of the activations that makes it easier to isolate the desired causal variable. For this, we typically use interchange intervention accuracy.

Interchange interventions are used to evaluate submissions. The causal model and language model are both run on a **base input** and then a variable in the causal model and aligned features in the language model are both set to the values they would take if a *counterfactual input* were used instead. The output of the causal model and language model are compared, and the more similar the inputs the more accurately the causal model abstracts the language model.

Baselines

We evaluate a mixture of supervised and unsupervised, as well as parametric and non-parametric, methods.

Distributed alignment search (DAS)
Differentiable binary masks (DBM) on standard dimensions of hidden vectors
DBM on Sparse autoencoder (SAE) features
DBM on Principal component analysis (PCA) features, i.e., principal components

As a naive baseline, we compare to no featurizer (i.e., the full untransformed vector).

Results

The supervised features from DAS generally performs best.

Learning masks over basis-aligned dimensions of hidden vectors and principal components are also strong methods.

SAEs fail to provide a better unit of analysis than basis-aligned dimensions, except for the RAVEL task for the continent causal variable in the Gemma-2 model.

SAEs are high-variance: sometimes they approach the performance of the best methods, but sometimes approach that of the worst.

How to cite

bibliography

Aaron Mueller*, Atticus Geiger*, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov, “MIB: A Mechanistic Interpretability Benchmark”. Proceedings of the Forty-second International Conference on Machine Learning (ICML 2025).

bibtex

@inproceedings{mib-2025,
	title = {{MIB}: A Mechanistic Interpretability Benchmark},
	author = {Aaron Mueller and Atticus Geiger and Sarah Wiegreffe and Dana Arad and Iv{\'a}n Arcuschin and Adam Belfki and Yik Siu Chan and Jaden Fiotto-Kaufman and Tal Haklay and Michael Hanna and Jing Huang and Rohan Gupta and Yaniv Nikankin and Hadas Orgad and Nikhil Prakash and Anja Reusch and Aruna Sankaranarayanan and Shun Shao and Alessandro Stolfo and Martin Tutek and Amir Zur and David Bau and Yonatan Belinkov},
	year = {2025},
	booktitle={Forty-second International Conference on Machine Learning},
	url = {https://arxiv.org/abs/2504.13151}
}

Original Source | Taken Source