🌟 CVPR 2025 Highlight Paper 🌟
Bardia Safaei Faizan Siddiqui Jiacong Xu Vishal M. Patel Shao-Yuan Lo
Johns Hopkins University, Honda Research Institute USA
- [06/08/2025]: 🔥 PreSel codebase is released. The selected 15% data and the finetuned models on these selected data can be downloaded now.
Please first install LLaVA:
cd PreSel
git clone https://github.com/haotian-liu/LLaVA.gitThen prepare the environment for LLaVA here.
For the LLaVA dataset, please download the LLaVA-665K dataset following the instructions from the LLaVA GitHub repository. This dataset is used for visual instruction tuning and contains a diverse set of visual-language examples.
For the Vision-FLAN dataset, please download the data from the Vision-FLAN website. This dataset provides a comprehensive collection of visual-language tasks for instruction tuning.
After downloading the datasets, please place all data files in the /datasets directory.
We first add a unique index for each instruction in the original dataset, to better identify each sample:
python data_process/preprocess.py \
--raw_annotation_path datasets/your_dataset.json \
--new_annotation_save_path datasets/processed_dataset.jsonThis script adds a unique identifier to each sample in your dataset, which is essential for the data selection process. The processed dataset will be saved to the specified path. We will be using the json files with the unique_idx included in the code.
Please note that as stated in the paper, for the LLaVA-1.5 dataset we remove the text-only instructions from the data, as our method focuses on selecting the images. You can either remove them yourself or use the already processed json file here.
For our method, we need to split the dataset into different tasks. We provide the task splits used in our experiments:
- LLaVA-1.5 task splits: Download splits
- Vision-FLAN dataset: Download splits
Place the downloaded and unzipped task split files in the data/ directory.
To estimate task importance values, we need a reference model trained on a small randomly selected reference dataset. You have two options:
For LLaVA-1.5 and Vision-FLAN datasets, you can directly use our randomly selected reference datasets (5% of images and their corresponding instructions from each task):
- LLaVA-1.5 reference data (randomly selected 5% images with instructions): Download JSON
- Vision-FLAN reference data (randomly selected 5% images with instructions): Download JSON
Place the downloaded JSON files in the data/ directory.
For custom datasets, you'll need to create a reference dataset by randomly sampling 5% of images along with their corresponding instructions from each task.
After preparing the reference dataset, fine-tune a LLaVA-7B model on it to obtain the reference model. For this step:
Fine-tune the LLaVA-7B model huggingface using LoRA training following the script provided here
This reference model will be used in later steps to estimate task-importance values.
First, process the reference data to remove the question parts of the instructions:
python data_process/remove_instruction.py \
--input_path /data/round1_665k_notext.json \
--output_path /data/round1_665k_notext_img_token.jsonThis will create a new file (/data/round1_665k_notext_img_token.json).
Then run the loss/perplexity calculations twice:
python presel/loss_ppl_calc.py \
--data_path /data/round1_665k_notext.json \
--model_path /PATH/TO/REFERENCE_MODEL \
--image_folder /datasets \
--output_file /data/loss_ppl_round1_665k_notext.jsonpython presel/loss_ppl_calc.py \
--data_path /data/round1_665k_notext_img_token.json \
--model_path /PATH/TO/REFERENCE_MODEL \
--image_folder /datasets \
--output_file /data/loss_ppl_round1_665k_notext_img_token.json- Replace
/PATH/TO/REFERENCE_MODELwith the path to your reference model checkpoint. - Adjust
--image_folderand--output_fileas needed for your setup.
Run the following to get the estimated task-importance values required for our data selection approach:
python presel/llava_task_importance.py \
--data_w_path /data/loss_ppl_round1_665k_notext.json \
--data_wo_path /data/loss_ppl_round1_665k_notext_img_token.json \
--reference_data_path /data/round1_665k_notext.json \
--task_files_dir /data \
--output_dir /dataFirst, we extract the visual features using DINOv2 model for each task (1 to 10 for the LLaVA dataset):
python data_process/extract_feats_665_dino.py --task_num TASK_NUMThen run k-means clustering and sample selection:
python data_process/kmeans_clust.py --method typicalFinally, run the following command to finetune the model on the selected data. Make sure to set the BASE_DIR value appropriately. This code implements multi-round training where each round has a budget of 5% of the total data. Note that the results reported in the main paper correspond to round 3 (15% budget).
python presel/data_selection.py \
--base_dir BASE_DIR \
--method presel \
--dataset_type llavaFor the Vision-FLAN dataset, the steps are similar to those for the LLaVA-1.5 dataset mentioned above. For "Loss/Perplexity Calculations", you can follow the same steps, but make sure to adjust the code to match the Vision-FLAN data format (e.g., JSON files, reference set, image folder, etc.).
For "Task Importance Estimation", you can directly download the estimated task importance values here and place it in /data directory.
For "Pre-Instruction Data Selection", first use the same script, data_process/extract_feats_665_dino.py, to extract VF features. Save the output as /data/dino_feats_vf/dino_feats_all_vf.pt. Then, run
python data_process/kmeans_clust_vf.py --method typicalFinally, run the following command to fine-tune the model on the selected Vision-FLAN data:
python presel/data_selection.py \
--base_dir BASE_DIR \
--method presel \
--dataset_type vision_flan \
--file_path /datasets/annotation_191-task_1k_add_idx.jsonYou can find our selected 15% subset of data via PreSel, as well as the fine-tuned models trained on it here:
| Dataset | 15% Selected Data by PreSel (JSON) | LLaVA-7B Model Finetuned |
|---|---|---|
| LLaVA-1.5 | Download | Download |
| Vision-FLAN | Download | Download |
Please follow the original LLaVA page and VLMEvalKit to evaluate models.
If you find this codebase useful for your research, please cite our paper:
@inproceedings{safaei2025filter,
title={Filter images first, generate instructions later: Pre-instruction data selection for visual instruction tuning},
author={Safaei, Bardia and Siddiqui, Faizan and Xu, Jiacong and Patel, Vishal M and Lo, Shao-Yuan},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={14247--14256},
year={2025}
}