This repository contains PyTorch implementation for SparseMM.
We investigate how MLLMs process visual inputs by analyzing their attention mechanisms and reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed Visual Heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis.
Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding.
- Clone this repository:
git clone https://github.com/CR400AF-A/SparseMM.git
cd SparseMM
- Init your environment
conda create -n sparsemm python=3.10 -y
conda activate sparsemm
- Install packages
Compile CUDA code for Flatten Cache Storage. If you encounter a CUDA compile error, please check your GPU Virtual Architecture and GPU Feature, then change the corresponding compile flag in csrc/build.py
pip install packaging torch==2.5.1
pip uninstall ninja && pip cache purge && pip install ninja --no-cache-dir
cd csrc && make
cd ..
Install other packages
pip install -e .
pip install flash-attn==2.4.1 --no-build-isolation # currently only support FlashAttention
pip install qwen-vl-utils
- Install lmms-eval for evaluation
cd lmms-eval
pip install -e .
cd ..
- download Synthdog dataset:
huggingface-cli download --repo-type dataset --resume-download nnethercott/synthdog-en-detection --local-dir /path/to/datasets/synthdog-en-detection
- process dataset:
python3 scripts/chase_visual_head/process_data.py
- chase visual head:
bash scripts/chase_visual_head/llava.sh
bash scripts/chase_visual_head/qwen.sh
bash scripts/eval/llava.sh
bash scripts/eval/mistral.sh
bash scripts/eval/qwen.sh
bash scripts/others/viz.sh
bash scripts/others/speed_and_memory.sh
If you found this repository useful, please consider citing:
@article{wang2025sparsemm,
title={SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs},
author={Wang, Jiahui and Liu, Zuyan and Rao, Yongming and Lu, Jiwen},
journal={arXiv preprint arXiv:2506.05344},
year={2025}
}