HOME
ABOUT
- RESULTS
- differences
- BENEFITS
- HISTORY
- TEAM
- LOCATION
- FACILITIES
- BANKING
- MEMBERSHIPS
- APPROVALS
- LICENCES
- SUPPLIERS
- SPONSORSHIPS
- MEDIA
- PRIVACY
AUCTIONS
SHIPPING
FEES
- TS REWARDS
TOOLS
guides
FAQ
CONTACT
- CONNECT

VEHICLES
BRAND
- JAPANESE CARS
  - DAIHATSU
  - EUNOS
  - FORD
  - HONDA
  - ISUZU
  - LEXUS
  - MAZDA
  - MITSUBISHI
  - MITSUOKA
  - NISSAN
  - SUBARU
  - SUZUKI
  - TOYOTA
- GERMAN CARS
- AMERICAN CARS
- BRITISH CARS
- ITALIAN CARS
- FRENCH CARS
- SWEDISH CARS
- KOREAN CARS
TYPE
- mobility
- VENDING
- instruction
- TAXIS
- AMBULANCES
- FIRE ENGINES
- HEARSES
- LIMOUSINES
- COMMERCIAL
CLASS
FUEL
TRUCKS
minitrucks
- DAIHATSU
- HONDA
- MAZDA
- MITSUBISHI
- NISSAN
- SUBARU
- SUZUKI
- DUMP
- CRANE
- CAMPER
- REFRIGERATED
- 4WD
- NEW
BUSES
MOTORHOMES
- YAHOO!
- RAKUTEN
- DEALER

PARTS
- FREE REPORT
- PARTS CONTAINERS
- PARTS SYSTEMS
- PARTS PROTECTION
- BODY SHELLS
- DISMANTLING
- ONLINE PARTS
- NEW PARTS
- INTERIOR PARTS
- EXTERIOR PARTS
  - BONNETS
  - BUMPERS
  - GRILLES
  - FENDERS
  - DOORS
  - TRUNKS
  - SPOILERS
  - LIGHTS
  - EMBLEMS
  - CAMERAS
- ENGINES
- TRANSMISSIONS
- WHEELS & TYRES
  - WHEELS
  - TYRES
CUTS
PERFORMANCE PARTS
TRUCK PARTS
MOTORBIKE PARTS
- MOTORBIKE ENGINES
- MOTORBIKE ACCESSORIES

MOTORBIKES
MARINE
FORKLIFTS
MACHINERY
AGRICULTURAL
OTHER
COUNTRY
- AUSTRALIA
- CANADA
- KENYA
- MYANMAR
- NEW ZEALAND
- PAKISTAN
- TANZANIA
- UNITED STATES

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Tue, 04 Mar 2025 01:41:19 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"67c65a3f-95dd" expires: Sun, 28 Dec 2025 19:21:03 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: F1DF:2D64E0:7DF3E3:8D5F3E:695180C6 accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 19:11:03 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210047-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766949063.311589,VS0,VE225 vary: Accept-Encoding x-fastly-request-id: ded887aa7d52661e905ef10d8402e5c48ec3fad2 content-length: 7621 GROUNDHOG GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

GROUNDHOG : Grounding Large Language Models to Holistic Segmentation

Yichi Zhang¹ Ziqiao Ma¹ Xiaofeng Gao² Suhaila Shakiah² Qiaozi Gao² Joyce Chai¹

¹University of Michigan ²Amazon AGI

CVPR 2024

Paper ArXiv BibTex
🤗Data (Raw) 🤗Data (Processed) Code (Coming Soon) 🤗Demo (Coming Soon)

Home Gallery Data

Summary and Highlight (TL;DR)

We present GROUNDHOG (logo) , a multimodal large language model developed by grounding large language models to holistic segmentation. GROUNDHOG (logo) is flexible and diagnosable, reduces object hallucination, and can plug in and play with any segmentation foundation model (e.g., SAM).

GROUNDHOG : Grounding LLMs to Holistic Segmentation

Model Architecture

Key Idea: GROUNDHOG (logo)

formulate the grounding process as an entity segment selection problem which involves (1) proposing entity segmentation masks where the masks encapsulate regions with discernible semantic content, and (2) recognizing the retrieved entities through the understanding of both visual and language context.
Details: GROUNDHOG (logo)

incorporates a masked feature extractor that takes an input image and a set of class-agnostic entity mask proposals, and converts each mask's features into visual entity tokens for an MLLM backbone. This MLLM then connects groundable phrases to unified grounding masks by retrieving and merging the entity masks. To enable holistic entity mask proposals, our default mask proposal model is an enhanced Mask2Former with 50 additional queries each for segmenting parts and text regions, alongside the original 200 entity queries.

Pointer Input

We introduce a pointer token <PTR> which refers to a specific point or region in the image input. <PTR> serves as a placeholder token to be replaced by the visual token from the mask proposal, which corresponds to the actual pointer input. For example, a user instruction can be formed as "What is that <PTR>?", with requests the model for a referring expression about a specific region.

M3G2: Dataset for Visually Grounding Instruction Tuning

We introduce the M3G2 dataset for Multi-Modal Multi-Grained Grounding. M3G2 is a comprehensive dataset consisting of 36 sub-problems, derived and augmented from 27 existing datasets with grounded vision-language annotations. The dataset is categorized into four main types: (1) Grounded Image Captioning (GCAP), (2) Referential Expression Segmentation (RES), (3) Grounded Visual Question Answering (GVQA), and (4) Referential Dialogue (RD).

More Details

Results and Applications

Grounded image captioning.

Grounded image captioning with short descriptions.

Grounded image captioning with detailed descriptions.

Referential expression segmentation.

Referential dialogue.

Grounded visual question answering.

More Examples

Less Hallucination, Diagnosability, and Plug-in-and-Play with SAM

Less Hallucination

We assessed object hallucination on the POPE benchmark, which includes binary questions about object existence.

Thanks to the varied task distribution and the inclusion of negative question-answering samples in M3G2 dataset, GROUNDHOG reduces object hallucination. Remarkably, GROUNDHOG consistently outperforms other models in both accuracy and F1 score across all splits, particularly on the more challenging ones. It shows an absolute improvement of 5.2% in accuracy for Popular and 4.0% for Adversarial over the previously best-performing model. This suggests that our model's enhanced grounding capability plays a significant role in mitigating the object hallucination problem.

Diagnosability and Explainability

GROUNDHOG enables diagnosability through the decoupled design of entity proposal and selection. This is exemplified in the case on the left, which illustrates the mask proposal scoring and selective merging process of our model. We show the top-4 masks, where the higher-score masks are labeled in green while the lower-score masks are labeled in red. Users can easily interpret that the failure is due to the incapability of MLLM to recognize the word "KWIK", despite it being successfully localized and proposed as an entity candidate.

Plug-in-and-Play with any segmentation foundation model

GROUNDHOG supports plug-in-and-play with any segmentation foundation model, e.g., SAM, as the model conditions the entity features solely on the binary masks without using any embeddings from the mask proposal model. For the pointer-to-mask conversion, we show the best-matched mask proposal from our Mask2Former+ model in comparison to the mask from SAM. The SAM-generated mask offers a more precise representation of the specified region, leading to a more accurate caption.

BibTeX


@inproceedings{zhang2024groundhog,
    title={GROUNDHOG: Grounding Large Language Models to Holistic Segmentation},
    author={Zhang, Yichi and Ma, Ziqiao and Gao, Xiaofeng and Shakiah, Suhaila and Gao, Qiaozi and Chai, Joyce},
    booktitle={Conference on Computer Vision and Pattern Recognition 2024},
    year={2024}
}

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, adapted from this repo.

HOME
ABOUT
AUCTIONS
SHIPPING
FEES
TOOLS
HOW
FAQ
CONTACT

Original Source | Taken Source