Yuxuan Cai1*, Jiangning Zhang2,3*, Haoyang He2, Xinwei He4, Ao Tong1,
Zhenye Gan3, Chengjie Wang3, Zhucun Xue2, Yong Liu2, Xiang Bai1
1Huazhong University of Science and Technology,
2Zhejiang University, 3Youtu Lab, Tencent, 4Huazhong Agricultural University
[Paper]
The success of Large Language Models (LLMs) has inspired the development of Multimodal Large Language Models (MLLMs) for unified understanding of vision and language. However, the increasing model size and computational complexity of large-scale MLLMs (
Benchmarked results with SoTA MLLMs. Compared with counterparts, our \method~achieves highly competitive results than current small-scale MLLM models.
AVG: The average of the nine benchmarks for comprehensive comparison except MMMU. 
-
Based on python3.12 and torch-2.6.0
-
Prepare the environment
python3.12 -m pip install --no-cache-dir --upgrade -r requirements.txt python3.12 -m pip install numpy==1.26.2 python3.12 -m pip install urllib3==1.26.6
-
Install cuda12.6
sh cuda_12.9.1_575.57.08_linux.run
-
Install Cusparselt
cd ../LLaVA_KD_whls/ rpm -i cusparselt-local-repo-rhel9-0.7.1-1.0-1.x86_64.rpm dnf clean all dnf -y install libcusparselt0 libcusparselt-devel -
Install bitsandbytes
cd ../LLaVA_KD_whls/bitsandbytes-0.46.0 python3.12 setup.py install -
Install deepspeed
python3.12 -m pip install ptflops python3.12 -m pip install deepspeed==0.14.4
| Model | Vision Encoder | LLM | CKPTs |
|---|---|---|---|
| LLaVA-KD-1B-Base-Qwen1.5 | siglip-so400m-patch14-384 | Qwen/Qwen1.5-0.5B | LLaVA-KD-Base-siglip-Qwen1.5-0.5B |
| LLaVA-KD-2B-Base-Qwen1.5 | siglip-so400m-patch14-384 | Qwen/Qwen1.5-1.8B | LLaVA-KD-Base-siglip-Qwen1.5-1.8B |
| LLaVA-KD-1B-Base-Qwen2.5 | siglip-so400m-patch14-384 | Qwen/Qwen2.5-0.5B | LLaVA-KD-Base-siglip-Qwen2.5-0.5B |
| LLaVA-KD-2B-Base-Qwen2.5 | siglip-so400m-patch14-384 | Qwen/Qwen2.5-1.5B | LLaVA-KD-Base-siglip-Qwen2.5-1.5B |
Please evaluate the model according to Evaluation.md.
Download the Pre-trained VisualEnc, LLM, LLaVAKD weights to the ./pretrained_ckpt. And then:
python quick_inference.py --model_path ./pretrained_ckpt/LLaVAKD_Model_Path --image_file ./image_test/img_test_1.jpg --query "What is that orange thing behind the girl?"- Release the training code
- Release the checkpoints
If you find this code useful, don't forget to star the repo and cite the paper.
@article{cai2024llava,
title={LLaVA-KD: A Framework of Distilling Multimodal Large Language Models},
author={Cai, Yuxuan and Zhang, Jiangning and He, Haoyang and He, Xinwei and Tong, Ao and Gan, Zhenye and Wang, Chengjie and Bai, Xiang},
journal={arXiv preprint arXiv:2410.16236},
year={2024}
}
We thank the great works TinyLLaVA, LLaVA for providing assistance for our research.

