| CARVIEW |
GROUNDHOG
: Grounding Large Language Models to Holistic Segmentation
Yichi Zhang1  Ziqiao Ma1  Xiaofeng Gao2  Suhaila Shakiah2  Qiaozi Gao2  Joyce Chai1 
1University of Michigan  2Amazon AGI 
CVPR 2024
Summary and Highlight (TL;DR)
We present GROUNDHOG
,
a multimodal large language model developed by grounding large language models to holistic segmentation.
GROUNDHOG
is flexible and diagnosable, reduces object hallucination, and can plug in and play with any segmentation foundation model (e.g., SAM).
GROUNDHOG
: Grounding LLMs to Holistic Segmentation
Model Architecture
Key Idea: GROUNDHOG
formulate the grounding process as an entity segment selection problem which involves
(1) proposing entity segmentation masks where the masks encapsulate regions with discernible semantic content, and
(2) recognizing the retrieved entities through the understanding of both visual and language context.
Details: GROUNDHOG
incorporates a masked feature extractor that takes an input image and a set of class-agnostic entity mask proposals,
and converts each mask's features into visual entity tokens for an MLLM backbone.
This MLLM then connects groundable phrases to unified grounding masks by retrieving and merging the entity masks.
To enable holistic entity mask proposals, our default mask proposal model is an enhanced Mask2Former
with 50 additional queries each for segmenting parts and text regions, alongside the original 200 entity queries.
Pointer Input
M3G2: Dataset for Visually Grounding Instruction Tuning
Results and Applications
Grounded image captioning.
Referential expression segmentation.
Referential dialogue.
Grounded visual question answering.
Less Hallucination, Diagnosability, and Plug-in-and-Play with SAM
Less Hallucination
Diagnosability and Explainability
Plug-in-and-Play with any segmentation foundation model
BibTeX
@inproceedings{zhang2024groundhog,
title={GROUNDHOG: Grounding Large Language Models to Holistic Segmentation},
author={Zhang, Yichi and Ma, Ziqiao and Gao, Xiaofeng and Shakiah, Suhaila and Gao, Qiaozi and Chai, Joyce},
booktitle={Conference on Computer Vision and Pattern Recognition 2024},
year={2024}
}