| CARVIEW |
🏆 "The VQA Series of Challenges" has been recognized with the 2025 Mark Everingham Prize
for "stimulating a new strand of vision and language research". Thank you to the PAMI TC committee!
What is VQA?
VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer.
- 265,016 images (COCO and abstract scenes)
- At least 3 questions (5.4 questions on average) per image
- 10 ground truth answers per question
- 3 plausible (but likely incorrect) answers per question
- Automatic evaluation metric
Subscribe to our group for updates!
Dataset
Details on downloading the latest dataset may be found on the download webpage.
-
April 2017: Full release (v2.0)
- 204,721 COCO images
(all of current train/val/test) - 1,105,904 questions
- 11,059,040 ground truth answers
-
⊕
March 2017: Beta v1.9 release
- 123,287 COCO images
(only train/val) - 658,111 questions
- 6,581,110 ground truth answers
- 1,974,333 plausible answers
- 31,325 abstract scenes
(only train/val) - 33,383 questions
- 333,830 ground truth answers
- ⊕
October 2015: Full release (v1.0)
- 204,721 COCO images
(all of current train/val/test) - 614,163 questions
- 6,141,630 ground truth answers
- 1,842,489 plausible answers
- 50,000 abstract scenes
- 150,000 questions
- 1,500,000 ground truth answers
- 450,000 plausible answers
- 250,000 captions
- ⊕
July 2015: Beta v0.9 release
- 123,287 COCO images (all of train/val)
- 369,861 questions
- 3,698,610 ground truth answers
- 1,109,583 plausible answers
- ⊕
June 2015: Beta v0.1 release
- 10,000 COCO images (from train)
- 30,000 questions
- 300,000 ground truth answers
- 90,000 plausible answers
Award
2025 Mark Everingham Prize
for contributions to the Computer Vision community, awarded at ICCV 2025 to
The VQA Series of Challenges
Aishwarya Agrawal, Yash Goyal, Ayush Shrivastava, Dhruv Batra, Devi Parikh and contributers
For stimulating a new strand of vision and language research
Thanks to all the contributors and collaborators (in alphabetical order):
Peter Anderson, Stanislaw Antol, Arjun Chandrasekaran, Prithvijit Chattopadhyay, Xinlei Chen, Abhishek Das, Karan Desai, Sashank Gondala, Khushi Gupta, Drew Hudson, Rishabh Jain, Yash Kant, Tejas Khot, Satwik Kottur, Stefan Lee, Jiasen Lu, Margaret Mitchell, Nirbhay Modhe, Akrit Mohapatra, José M. F. Moura, Vishvak Murahari, Vivek Natarajan, Viraj Prabhu, Marcus Rohrbach, Meet Shah, Amanpreet Singh, Avi Singh, Douglas Summers-Stay, Deshraj Yadav, Peng Zhang, Larry Zitnick.
Papers
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (CVPR 2017)
Download the paper
@InProceedings{balanced_vqa_v2,
author = {Yash Goyal and Tejas Khot and Douglas Summers{-}Stay and Dhruv Batra and Devi Parikh},
title = {Making the {V} in {VQA} Matter: Elevating the Role of Image Understanding in {V}isual {Q}uestion {A}nswering},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2017},
}
Yin and Yang: Balancing and Answering Binary Visual Questions (CVPR 2016)
Download the paper
@InProceedings{balanced_binary_vqa,
author = {Peng Zhang and Yash Goyal and Douglas Summers{-}Stay and Dhruv Batra and Devi Parikh},
title = {{Y}in and {Y}ang: Balancing and Answering Binary Visual Questions},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2016},
}
VQA: Visual Question Answering (ICCV 2015)
Download the paper
@InProceedings{{VQA},
author = {Stanislaw Antol and Aishwarya Agrawal and Jiasen Lu and Margaret Mitchell and Dhruv Batra and C. Lawrence Zitnick and Devi Parikh},
title = {{VQA}: {V}isual {Q}uestion {A}nswering},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2015},
}
Videos
Feedback
Any feedback is very welcome! Please send it to visualqa@gmail.com.


