About the project(paper included appendix is here)
We introduce ComprehendEdit, a comprehensive benchmark with enhanced metrics for multimodal knowledge editing. ComprehendEdit incorporates eight diverse tasks derived from multiple datasets, providing a more robust and varied evaluation framework. Two novel evaluation metrics are introduced: Knowledge Generalization Index (KGI) and Knowledge Preservation Index (KPI), which assess the impact of knowledge editing on in-domain samples. The variety in question types of existing datasets (generated by Llama-2-7b-chat-hf) and ComprehendEdit are shown in following table:
Task | E-VQA | VLKEB | ComprehendEdit |
---|---|---|---|
Object Recognition | 4,854 | 8,089 | 2,962 |
Object Attributes | 1,435 | 27 | 2,987 |
Object Counting | 1,213 | 0 | 2,009 |
Object Existence | 845 | 3 | 1,962 |
Scene Information | 45 | 44 | 2,854 |
Numerical Inference | 23 | 0 | 846 |
Spatial Relationship | 16 | 1 | 2,239 |
Text Recognition | 8 | 0 | 2,073 |
Total | 8,439 | 8,164 | 17,932 |
ComprehendEdit focus on evaluate the edited model on in-domain samples, as shown in following figure:
Here are some samples of ComprehendEdit:
Q, G, P, S, C mean Question, Ground-truth, Prediction, Source, task Category independently.
The dataset is organized as follows:
|——ComprehendEdit/
| |——GQA/
| | |——images/
| | | |——21.jpg
| | | |——...
| |——MathVista/
| | |——images/
| |——TallyQA/
| | |——VG_100K/
| | |——VG_100K_2/
| |——TextVQA/
| | |——train_images/
| |——VSR/
| | |——images/
| |——val2014/
|——ComprehendEdit_train.json
|——ComprehendEdit_test.json
|——ComprehendEdit_ori_right.json
The format of each sample in test set is
[{
"image": "GQA/images/2405722.jpg",
"question": "What is this bird called?",
"rephrase": "What is the bird's name?", # for Text-Generality
"answer": "parrot",
"source": "GQA",
"Category": "object recognition",
"pid": 0,
"img_topk": [...], # pid of the image topk nearest samples in test set
"txt_topk": [...], # pid of the text topk nearest samples in test set
"img_last_topk": [...], # pid of the image topk farthest samples in test set
"txt_last_topk": [...], # pid of the text topk farthest samples in test set
"ori_rt_img_topk": [...], # pid of the image topk nearest samples in ComprehendEdit_ori_right.json
"ori_rt_txt_topk": [...], # pid of the text topk nearest samples in ComprehendEdit_ori_right.json
"ori_rt_img_last_topk": [...], # pid of the image topk farthest samples in ComprehendEdit_ori_right.json
"ori_rt_txt_last_topk": [...], # pid of the text topk farthest samples in ComprehendEdit_ori_right.json
"locality_prompt": "when does twice upon a time come out", # for Text-Locality
"locality_ground_truth": "...",
"multimodal_locality_image": "...", # for Multimodal-Locality
"multimodal_locality_prompt": "...",
"multimodal_locality_ground_truth": "..."}, ...]
The details of ComprehendEdit is shown in following table:
Task | Train | Test | Source |
---|---|---|---|
Object Recognition | 1,471 | 491 | GQA |
Object Attributes | 2,227 | 735 | GQA |
Object Counting | 2,282 | 705 | GQA |
Object Existence | 1,506 | 503 | TallyQA |
Scene Information | 2,067 | 787 | GQA |
Numerical Inference | 1,709 | 530 | VSR |
Spatial Relationship | 1,554 | 519 | TextVQA |
Text Recognition | 634 | 212 | MathVista |
Total | 13,450 | 4,482 |
The ratio of training data to test data in each task is approximately 3:1, and we also utilize samples from the NQ dataset and OK-VQA dataset to measure text locality (T-L) and multimodal locality (M-L).
This dataset is collected from several benchmarks using BLIP-2 OPT 2.7B and MiniGPT-4 7B. We recommand measuring the changes on top-10 prediction on locality samples before and after editing if you want to run other models on ComprehendEdit. We will update the results in months.
The dataset can be downloaded from baiduyun or google driver. The project is built based on EasyEdit. The class ComprehendEdit is located in ComprehendEdit/easyeditor/dataset/ComprehendEdit.py, and you can import it just like E-VQA.
The conda environment is provided in EasyEdit multimodal knowledge editing, and the links of the pretrained model weights are provided in VLKEB.
To run the code, you can use the following command:
sh run_multi.sh # or python3 multimodal_edit_our.py
And you can change the algorithm name in multimodal_edit_our.py to run other models. For example,
train_HICE(model='blip2', train=True)
this code means we will train HICE based on BLIP-2 OPT 2.7B. After training, you can just change train=False to evaluate the model.
Besides, you can also change the hyperparameters yamls in ComprehendEdit/hparams. For example, your can change the ComprehendEdit/hparams/TRAINING/HICE/minigpt4.yaml to decide run the code on different gpus, change the path of pretrained model and so on. In yaml files, gpu_used_id and gpu_split are used to split the model to different gpus.
If you want to run experiments on one gpu, you can set model_parallel=False and gpu_split=[]. If you want to run experiments on other models, you can add the model setting in ComprehendEdit/easyeditor/util/tools.py to support the model. (using device_map="auto" simply may cause out-of-memory on the main gpu if the dataset is too large, running on too many gpus will waste the gpus and need more time.)
Thanks for the framework provided by EasyEdit! The samples in ComprehendEdit come from several datasets: GQA, TallyQA, VSR, TextVQA, MathVista, OKVQA, and NQ dataset. Part of the code references RanPAC. Thanks for these outstanding works!
Please cite our paper if you use ComprehendEdit in your work.