You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features
over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.
We release the extracted features and reproducible code here.
Specifically, we develop our methods in two scenarios: (1) direct task-specific fine-tuning; and (2) Vision and Language pre-training.
Please see the corresponding code directory for full details.
Noted that in direct finetuning, for Visual Question Answering on VQA 2.0 test-dev, we are able to achieve up to 68.37% accuracy with Pythia, 74.01% accuracy with MCAN and generally more than 4.0% improvements in accuracy;
For Image Captioning on Karpathy's test split of MS COCO, we got 2.1% improvements in CIDEr metric over resnet alternatives;
For Navigation, On RxR, we got 5% improvements with the nDTW metric (the main metric for RxR). On R2R, we got about 6% improvements in accuracy regarding our strong baselines.
CLIP-ViL-Pretrain
In order to test the potential of combining CLIP pre-training and Vision and Language pre-training. We introduce CLIP-ViL-Pretrain, a vision-and-language model
pre-trained on image-text data with CLIP visual encoder as its visual backbone. CLIP-ViL-Pretrain is pretrained on aligned image-text data with a reconstructive objective and an image-text matching objective. It is further finetuned on VQA, SNLI-VE and GQA tasks.
Please see the corresponding code directory for full details.
Noted that CLIP-ViL-Pretrain is able to achieve 76.48% accuracy on VQA 2.0 test-dev and 76.70% accuracy on test-std; 80.61% accuracy on SNLI-VE Dev and 80.20% on Test-P; 61.42% accuracy on GQA test-dev and 62.93% accuracy on test-std.
If you use CLIP-ViL in your research or wish to refer to the baseline results published here,
please use the following BibTeX entry.
@article{shen2021much,
title={How Much Can CLIP Benefit Vision-and-Language Tasks?},
author={Shen, Sheng and Li, Liunian Harold and Tan, Hao and Bansal, Mohit and Rohrbach, Anna and Chang, Kai-Wei and Yao, Zhewei and Keutzer, Kurt},
journal={arXiv preprint arXiv:2107.06383},
year={2021}
}