CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://leofan90.github.io/SparseVLMs.github.io/ x-github-request-id: 3798:2D64E0:84940F:9520AC:695216DF accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 05:51:27 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210089-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766987488.684525,VS0,VE198 vary: Accept-Encoding x-fastly-request-id: e78c5423ffd5bb8537ed3bcdf8f30bcc9aa36f49 content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 x-origin-cache: HIT last-modified: Sun, 13 Oct 2024 03:29:34 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"670b3e9e-1a52c" expires: Mon, 29 Dec 2025 06:01:28 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 3798:2D64E0:849416:9520B2:695216DF accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 05:51:28 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210089-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766987488.901722,VS0,VE221 vary: Accept-Encoding x-fastly-request-id: 519b178f91a5a02a8ce093516677482b6cceda2e content-length: 37475 SparseVLM

SparseVLM : Visual Token Sparsification for Efficient Vision- Language Model Inference

Yuan Zhang^1* Chun-Kai Fan^1* Junpeng Ma^2* Wenzhao Zheng^✉,3 Tao Huang⁴ Kuan Cheng¹ Denis Gudovskiy⁵ Tomoyuki Okuno⁵ Yohei Nakata⁵ Kurt Keutzer³ Shanghang Zhang^✉,1

¹School of Computer Science, Peking University ²Fudan University

³UC Berkeley ⁴The University of Sydney ⁵Panasonic Holdings Corporation

*Equal contribution, ✉Corresponding author

arXiv Paper Code Twitter

Abstract

In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens and require additional training data. Differently, we propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs. Concretely, given that visual tokens complement text tokens in VLMs for linguistic reasoning, we select visual-relevant text tokens to rate the significance of vision tokens within the self-attention matrix extracted from the VLMs. Then we progressively prune irrelevant tokens.

To maximize sparsity while retaining essential information, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks. In particular, LLaVA equipped with SparseVLM reduces 61% ~ 67% FLOPs with a compression ratio of 78% while maintaining 93% of the accuracy.

Sample prompts from four representative multimodal benchmarks

We show four representative cases where we compute the correlation between the prompt and the image. The darker the word, the greater its relationship to the image and the more valuable it is for reference. We see that some words are irrelevant to the vision domain (e.g., prepositions and pronouns) and should not be considered for visual sparsification. For example, case 3 highlights Tylenol, Advil, ibuprofen, while top, sticker, fridge in case 4 are significant, where a large proportion of question tokens in light red include little visual relevance.

Our Pipeline

Relevant Text Token Selection. Before LLM, we will first pre-select relevant text tokens as text raters. As we can see, from example prompts of four different benchmarks, it is not appropriate to use all text tokens as a reference for visual sparsification. Therefore, we calculate the similarity between the prompt and the image, and select candidates who exceed the mean of similarity values as the text raters.
Estimation of Visual Token Significance. In our case, we need to understand how relevant a visual token is to the textual tokens in order to determine whether it should be removed. Therefore, we naturally come up with reusing the self-attention logits in VLMs transformer layers as a reference, since they already contain language-to-vision query results.
Sparsification Level Adaptation. We further propose a rank-based strategy to adaptively determine the level of vision sparsification at each decoder layer. The difference between the dimension and rank of the self-attention logits reflects its redundancy.
Token Aggregation. We first recycle the pruned visual tokens h_v with the top-k highest values in the self-attention logits from the deleted pool. Then, we group h_v tokens with k-nearest neighbor density peak aggregation algorithm for adaptive token aggregation.
Token Reconstruction. Having performed token aggregation, the recycled tokens with similar semantics are classified into the same group.

Experiment Results

Image Understanding Tasks

Method	GQA	MMB	MME	POPE	SQA	VQA^V2	VQA_Text	ConB	Avg.
Upper Bound, 576 Tokens (100%)
Vanilla	61.9 100%	64.7 100%	1862 100%	85.9 100%	69.5 100%	78.5 100%	58.2 100%	19.8 100%	100%
Retain 192 Tokens (↓ 66.7%)
ToMe _(ICLR23)	54.3 87.7%	60.5 93.5%	1563 83.9%	72.4 84.3%	65.2 93.8%	68.0 86.6%	52.1 89.5%	17.4 87.9%	88.4%
FastV _(ECCV24)	52.7 85.1%	61.2 94.6%	1612 86.6%	64.8 75.4%	67.3 96.8%	67.1 85.5%	52.5 90.2%	18.0 90.9%	88.1%
SparseVLM	57.6 93.1%	62.5 96.6%	1721 92.4%	83.6 97.3%	69.1 99.4%	75.6 96.3%	56.1 96.4%	18.8 94.9%	95.8% ↑ (7.4%)
Retain 128 Tokens (↓ 77.8%)
ToMe _(ICLR23)	52.4 84.7%	53.3 82.4%	1343 72.1%	62.8 73.1%	59.6 85.8%	63.0 80.2%	49.1 84.4%	16.0 80.8%	80.4%
FastV _(ECCV24)	49.6 80.1%	56.1 86.7%	1490 80.0%	59.6 69.4%	60.2 86.6%	61.8 78.7%	50.6 86.9%	17.1 86.4%	81.9%
SparseVLM	56.0 90.5%	60.0 92.7%	1696 91.1%	80.5 93.7%	67.1 96.5%	73.8 94.0%	54.9 94.3%	18.5 93.4%	93.3% ↑ (11.4%)
Retain 64 Tokens (↓ 88.9%)
ToMe _(ICLR23)	48.6 78.5%	43.7 67.5%	1138 61.1%	52.5 61.1%	50.0 71.9%	57.1 72.7%	45.3 77.8%	14.0 70.7%	70.2%
FastV _(ECCV24)	46.1 74.5%	48.0 74.2%	1256 67.5%	48.0 55.9%	55.1 73.5%	55.0 70.1%	47.8 82.1%	15.6 78.8%	72.1%
SparseVLM	52.7 85.1%	56.2 86.9%	1505 80.8%	75.1 87.4%	62.2 89.4%	68.2 86.9%	51.8 89.0%	17.7 89.4%	86.9% ↑ (14.8%)

Table 1: Performance of SparseLLaVA under different vision token configurations. The vanilla number of vision tokens is 576. The first line of each method is the raw accuracy of benchmarks, and the second line is the proportion relative to the upper limit. The last column is the average value.

Figure 1: Performance of MGM armed with SparseVLM on three multimodal benchmarks. The horizontal axis represents the remaining number of vision tokens, while the vertical axis means the accuracy after percentage normalization. FastV is included for comparison.

Video Understanding Tasks

Method	TGIF		MSVD		MSRVTT		ActivityNet		Avg
Method	Acc	Score	Acc	Score	Acc	Score	Acc	Score	Acc	Score
Video-LLAVA	47.1	3.35	69.8	3.92	56.7	3.48	43.1	3.35	100.0%	+0.00
FastV _(ECCV24)	23.1 49.0%	2.47 -0.88	38.0 54.4%	2.71 -1.21	19.3 34.0%	2.02 1.46	30.6 71.0%	2.82 -0.53	52.1%	-1.02
SparseVLM	44.7 94.9%	3.29 -0.06	68.2 97.7%	3.90 -0.02	31.0 54.7%	2.68 -0.80	42.6 98.8%	3.32 -0.03	86.5% ↑ (34.4%)	-0.17 ↑ (0.85)

Table 2: The results of Video-LLaVA with SparseVLM on video question answering task. The original number of video tokens is 2048, while our experiment collectively prunes it down to 135 tokens. FastV is included for comparison. The GPT-3.5 turbo is adopted for assistive evaluation.

Visualization of SparseVLM on different VQA prompts

Contact

If you have any questions, please feel free to contact us:

Yuan Zhang: zhangyuan@stu.pku.edu.cn
Chun-Kai Fan: chunkaifan@stu.pku.edu.cn
Junpeng Ma: jpma24@m.fudan.edu.cn
Wenzhao Zheng: wzzheng@berkeley.edu
Shanghang Zhang: shanghang@pku.edu.cn

BibTeX

        
          @article{zhang2024sparsevlm,
            title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference},
            author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and Nakata, Yohei and Keutzer, Kurt and others},
            journal={arXiv preprint arXiv:2410.04417},
            year={2024}
          }