- KVzip compresses the KV cache to support diverse future queries.
- [Context-dependent] Achieve a 3β4Γ reduction in KV cache size and a 2Γ decrease in decoding latency, with minimal performance degradation.
- [Context-independent] Enhance DuoAttention-style head-level KV compression, using only a few forward passes within one minute for head-level importance-score optimization (100x faster).
- Run demo.py:
- Tasks: SQuAD, NIAH, SCBench, GSM8K.
- Model: Qwen2.5-7B-Instruct-1M
- 07/2025: π KVzip has been accepted at NeurIPS 2025 as an Oral Presentation!
- 07/2025: NVIDIA KVpress adds support for KVzip (see also Leaderboard).
- 07/2025: KVzip is presented at the ES-FoMo III ICML Workshop.
- 05/2025: arXiv preprint is released.
We used CUDA 12.1 and Python 3.10
cd KVzip
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation
make i
- To use QServe quantization, please follow
./model/quant_model
.
from model import ModelKVzip
model = ModelKVzip("Qwen/Qwen2.5-7B-Instruct-1M")
context = "This is my basic profile. My name is Kim living in Seoul. My major is computer science."
queries = ["What is my name?", "Do I live in Seoul?"]
kv = model.prefill(context, load_score=False) # prefill KV cache + importance scoring
kv.prune(ratio=0.3) # compression ratio, evict 70% KV
for q in queries:
query_ids = model.apply_template(q)
output = model.generate(query_ids, kv=kv, update_cache=False) # efficient inference
print(q, output)
- Supported models are listed in
model/load.py
, including LLaMA3, Qwen2.5/3, Gemma3. - Set
load_score=True
to eliminate compression overhead. This enables context-independent KV eviction, with a trade-off in compression ratio ofratio=0.6
. - After generation, KV pairs corresponding to the queries and generated tokens are selectively evicted from the cache for further processing. Set
update_cache=True
to enable multi-turn inference, retaining full interaction histories throughout the inference.
python -B test.py -m [model_name] -d [data_name] --kv_type evict --ratio 0.3
- The code above also compares outputs generated with full versus pruned KV caches.
- To quick test, use
-d squad
. For long-context testing, use-d scbench_kv
.- Available data names:
data/load.py
. - Available model names:
model/load.py
, e.g., llama3.1-8b, qwen2.5-7b (or Qwen/Qwen2.5-7B-Instruct-1M).
- Available data names:
- We adapt CUDA kernel from AdaKV, supporting non-uniform head budget allocation.
- Currently, our code lacks an optimized kernel for Gemma3 which uses static KV cache, so the code does not yield actual efficiency gains. However, model performance can still be evaluated using reduced attention with KV subsampling (
--kv_type retain
).
- Currently, our code lacks an optimized kernel for Gemma3 which uses static KV cache, so the code does not yield actual efficiency gains. However, model performance can still be evaluated using reduced attention with KV subsampling (
- Use the
--level head
flag with--ratio 0.6
(recommended).- We remove all context KV pairs associated with a specific head while retaining system prompt and query KV pairs.
- Precomputed head scores are available for LLaMA3.1-8B and Qwen2.5-7/14B in
./utils/head_score
.
- To compute head scores for other models:
python -B test.py -m [model_name] -d scbench_qa_eng --save_head_score
- Results will be saved in
./utils/head_score
. - If targeting a coding task, we recommend additionally running the command with
-d scbench_repoqa
. This allows the model to use the max head scores from both natural and coding languages, which improves performance.
- Results will be saved in
- These scores can be seamlessly integrated with DuoAttention's optimized inference engine by replacing their head score data with ours.
- To generate model responses with KV compression ratios ranging from 0.1 to 1.0:
python -B eval.py -m [model_name] -d [data_name] --kv_type retain --num 100
- Results will be saved in
./results/[data_name]
. - Supported datasets are listed in
data/load.py
.
- Results will be saved in
- To compute evaluation metrics from generated results:
python -B -m results.parse -m [model_name] -d [data_name]
To integrate KVzip for a new model, you will need to update the following files:
attention/attn.py
Modify the attention forward pass logic as needed. In certain cases, updates to kvcache.py and score.py may also be required.model/monkeypatch.py
Implement model-specific monkey patching for integration.model/template.py
Define the model's system prompt and chat formatting templates.
@article{kim2025kvzip,
title={KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction},
author={Kim, Jang-Hyun and Kim, Jinuk and Kwon, Sangwoo and Lee, Jae W and Yun, Sangdoo and Song, Hyun Oh},
journal={Advances in Neural Information Processing Systems},
year={2025}
}
MIT License