You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Mipha training consists of two stages: (1) feature alignment stage: use LLaVA-1.5 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM;
(2) visual instruction tuning stage: visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data, plus around 515K VQA data from academic-oriented tasks, to teach the model to follow multimodal instructions.
Hyperparameters
The hyperparameters used in pretraining and finetuning are provided below.
Pretraining
Hyperparameter
Global Batch Size
Learning rate
Epochs
Max length
Weight decay
Mipha
256
1e-3
1
2048
0
Finetuning
Hyperparameter
Global Batch Size
Learning rate
Epochs
Max length
Weight decay
Mipha
128
2e-5
2
2048
0
Download base checkpoints
Our base model is phi-2. You should download the weights from here, and change the --model_name_or_path in get_base_model.sh.
Our vision encoder is SigLIP-SO (0.4B). You should download the weights from here.
Integrate the model
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions from here.
Then, you should integrate phi-2 and SigLIP-SO into a single model by running the following script:
bash ./script/mipha/get_base_model.sh
Pretrain (feature alignment)
bash ./scripts/mipha/pretrain.sh
Visual Instruction Tuning
Please refer here to prepare the instruction tuning data.
Training script with DeepSpeed ZeRO-3: finetune.sh.
bash ./scripts/mipha/finetune.sh
Evaluation
To ensure the reproducibility, we evaluate the models with greedy decoding.
If you find LLaVA-Phi or Mipha useful in your research or applications, please consider giving a star β and citing using the following BibTeX:
@inproceedings{zhu2024llava,
title={Llava-phi: Efficient multi-modal assistant with small language model},
author={Zhu, Yichen and Zhu, Minjie and Liu, Ning and Xu, Zhiyuan and Peng, Yaxin},
booktitle={Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited},
pages={18--22},
year={2024}
}
@article{zhu2024comprehensive,
title={A Comprehensive Overhaul of Multimodal Assistant with Small Language Models},
author={Zhu, Minjie and Zhu, Yichen and Liu, Xin and Liu, Ning and Xu, Zhiyuan and Shen, Chaomin and Peng, Yaxin and Ou, Zhicai and Feng, Feifei and Tang, Jian},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2024}
}
Acknowledgement
We build our project based on
LLaVA: an amazing open-sourced project for vision language assistant
LLaMA-Factory: We use this codebase to finetune SLMs
Safe-RLHF: We use this codebase to instruct-tune SLMs