You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Welcome to the official code repository for the paper "A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models" authored by Noriyuki Kojima, Hadar Averbuch-Elor, and Yoav Artzi.
Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates.
Codebase
Installation
Set up the conda environment: conda create -n grounding python=3.8
If our work aids your research, kindly reference our paper:
@misc{Kojima2023:grounding,
title = {A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models},
author = {Noriyuki Kojima and Hadar Averbuch-Elor and Yoav Artzi},
year = {2023},
eprint = {},
archiveprefix = {arXiv}
}