| CARVIEW |
To maximize sparsity while retaining essential information, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks. In particular, LLaVA equipped with SparseVLM reduces 61% ~ 67% FLOPs with a compression ratio of 78% while maintaining 93% of the accuracy.
Sample prompts from four representative multimodal benchmarks
We show four representative cases where we compute the correlation between the prompt and the image. The darker the word, the greater its relationship to the image and the more valuable it is for reference. We see that some words are irrelevant to the vision domain (e.g., prepositions and pronouns) and should not be considered for visual sparsification. For example, case 3 highlights Tylenol, Advil, ibuprofen, while top, sticker, fridge in case 4 are significant, where a large proportion of question tokens in light red include little visual relevance.
Our Pipeline
- Relevant Text Token Selection. Before LLM, we will first pre-select relevant text tokens as text raters. As we can see, from example prompts of four different benchmarks, it is not appropriate to use all text tokens as a reference for visual sparsification. Therefore, we calculate the similarity between the prompt and the image, and select candidates who exceed the mean of similarity values as the text raters.
- Estimation of Visual Token Significance. In our case, we need to understand how relevant a visual token is to the textual tokens in order to determine whether it should be removed. Therefore, we naturally come up with reusing the self-attention logits in VLMs transformer layers as a reference, since they already contain language-to-vision query results.
- Sparsification Level Adaptation. We further propose a rank-based strategy to adaptively determine the level of vision sparsification at each decoder layer. The difference between the dimension and rank of the self-attention logits reflects its redundancy.
- Token Aggregation. We first recycle the pruned visual tokens hv with the top-k highest values in the self-attention logits from the deleted pool. Then, we group hv tokens with k-nearest neighbor density peak aggregation algorithm for adaptive token aggregation.
- Token Reconstruction. Having performed token aggregation, the recycled tokens with similar semantics are classified into the same group.
Experiment Results
Image Understanding Tasks
| Method | GQA | MMB | MME | POPE | SQA | VQAV2 | VQAText | ConB | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Upper Bound, 576 Tokens (100%) | |||||||||
| Vanilla | 61.9 100% |
64.7 100% |
1862 100% |
85.9 100% |
69.5 100% |
78.5 100% |
58.2 100% |
19.8 100% |
100% |
| Retain 192 Tokens (↓ 66.7%) | |||||||||
| ToMe (ICLR23) | 54.3 87.7% |
60.5 93.5% |
1563 83.9% |
72.4 84.3% |
65.2 93.8% |
68.0 86.6% |
52.1 89.5% |
17.4 87.9% |
88.4% |
| FastV (ECCV24) | 52.7 85.1% |
61.2 94.6% |
1612 86.6% |
64.8 75.4% |
67.3 96.8% |
67.1 85.5% |
52.5 90.2% |
18.0 90.9% |
88.1% |
| SparseVLM | 57.6 93.1% |
62.5 96.6% |
1721 92.4% |
83.6 97.3% |
69.1 99.4% |
75.6 96.3% |
56.1 96.4% |
18.8 94.9% |
95.8% ↑ (7.4%) |
| Retain 128 Tokens (↓ 77.8%) | |||||||||
| ToMe (ICLR23) | 52.4 84.7% |
53.3 82.4% |
1343 72.1% |
62.8 73.1% |
59.6 85.8% |
63.0 80.2% |
49.1 84.4% |
16.0 80.8% |
80.4% |
| FastV (ECCV24) | 49.6 80.1% |
56.1 86.7% |
1490 80.0% |
59.6 69.4% |
60.2 86.6% |
61.8 78.7% |
50.6 86.9% |
17.1 86.4% |
81.9% |
| SparseVLM | 56.0 90.5% |
60.0 92.7% |
1696 91.1% |
80.5 93.7% |
67.1 96.5% |
73.8 94.0% |
54.9 94.3% |
18.5 93.4% |
93.3% ↑ (11.4%) |
| Retain 64 Tokens (↓ 88.9%) | |||||||||
| ToMe (ICLR23) | 48.6 78.5% |
43.7 67.5% |
1138 61.1% |
52.5 61.1% |
50.0 71.9% |
57.1 72.7% |
45.3 77.8% |
14.0 70.7% |
70.2% |
| FastV (ECCV24) | 46.1 74.5% |
48.0 74.2% |
1256 67.5% |
48.0 55.9% |
55.1 73.5% |
55.0 70.1% |
47.8 82.1% |
15.6 78.8% |
72.1% |
| SparseVLM | 52.7 85.1% |
56.2 86.9% |
1505 80.8% |
75.1 87.4% |
62.2 89.4% |
68.2 86.9% |
51.8 89.0% |
17.7 89.4% |
86.9% ↑ (14.8%) |
Table 1: Performance of SparseLLaVA under different vision token configurations. The vanilla number of vision tokens is 576. The first line of each method is the raw accuracy of benchmarks, and the second line is the proportion relative to the upper limit. The last column is the average value.
Figure 1: Performance of MGM armed with SparseVLM on three multimodal benchmarks. The horizontal axis represents the remaining number of vision tokens, while the vertical axis means the accuracy after percentage normalization. FastV is included for comparison.
Video Understanding Tasks
| Method | TGIF | MSVD | MSRVTT | ActivityNet | Avg | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Acc | Score | Acc | Score | Acc | Score | Acc | Score | Acc | Score | |
| Video-LLAVA | 47.1 | 3.35 | 69.8 | 3.92 | 56.7 | 3.48 | 43.1 | 3.35 | 100.0% | +0.00 |
| FastV (ECCV24) | 23.1 49.0% |
2.47 -0.88 |
38.0 54.4% |
2.71 -1.21 |
19.3 34.0% |
2.02 1.46 |
30.6 71.0% |
2.82 -0.53 |
52.1% | -1.02 |
| SparseVLM | 44.7 94.9% |
3.29 -0.06 |
68.2 97.7% |
3.90 -0.02 |
31.0 54.7% |
2.68 -0.80 |
42.6 98.8% |
3.32 -0.03 |
86.5% ↑ (34.4%) |
-0.17 ↑ (0.85) |
Table 2: The results of Video-LLaVA with SparseVLM on video question answering task. The original number of video tokens is 2048, while our experiment collectively prunes it down to 135 tokens. FastV is included for comparison. The GPT-3.5 turbo is adopted for assistive evaluation.
Visualization of SparseVLM on different VQA prompts
Contact
If you have any questions, please feel free to contact us:
- Yuan Zhang: zhangyuan@stu.pku.edu.cn
- Chun-Kai Fan: chunkaifan@stu.pku.edu.cn
- Junpeng Ma: jpma24@m.fudan.edu.cn
- Wenzhao Zheng: wzzheng@berkeley.edu
- Shanghang Zhang: shanghang@pku.edu.cn
BibTeX
@article{zhang2024sparsevlm,
title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference},
author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and Nakata, Yohei and Keutzer, Kurt and others},
journal={arXiv preprint arXiv:2410.04417},
year={2024}
}