HOME
ABOUT
- RESULTS
- differences
- BENEFITS
- HISTORY
- TEAM
- LOCATION
- FACILITIES
- BANKING
- MEMBERSHIPS
- APPROVALS
- LICENCES
- SUPPLIERS
- SPONSORSHIPS
- MEDIA
- PRIVACY
AUCTIONS
SHIPPING
FEES
- TS REWARDS
TOOLS
guides
FAQ
CONTACT
- CONNECT

VEHICLES
BRAND
- JAPANESE CARS
  - DAIHATSU
  - EUNOS
  - FORD
  - HONDA
  - ISUZU
  - LEXUS
  - MAZDA
  - MITSUBISHI
  - MITSUOKA
  - NISSAN
  - SUBARU
  - SUZUKI
  - TOYOTA
- GERMAN CARS
- AMERICAN CARS
- BRITISH CARS
- ITALIAN CARS
- FRENCH CARS
- SWEDISH CARS
- KOREAN CARS
TYPE
- mobility
- VENDING
- instruction
- TAXIS
- AMBULANCES
- FIRE ENGINES
- HEARSES
- LIMOUSINES
- COMMERCIAL
CLASS
FUEL
TRUCKS
minitrucks
- DAIHATSU
- HONDA
- MAZDA
- MITSUBISHI
- NISSAN
- SUBARU
- SUZUKI
- DUMP
- CRANE
- CAMPER
- REFRIGERATED
- 4WD
- NEW
BUSES
MOTORHOMES
- YAHOO!
- RAKUTEN
- DEALER

PARTS
- FREE REPORT
- PARTS CONTAINERS
- PARTS SYSTEMS
- PARTS PROTECTION
- BODY SHELLS
- DISMANTLING
- ONLINE PARTS
- NEW PARTS
- INTERIOR PARTS
- EXTERIOR PARTS
  - BONNETS
  - BUMPERS
  - GRILLES
  - FENDERS
  - DOORS
  - TRUNKS
  - SPOILERS
  - LIGHTS
  - EMBLEMS
  - CAMERAS
- ENGINES
- TRANSMISSIONS
- WHEELS & TYRES
  - WHEELS
  - TYRES
CUTS
PERFORMANCE PARTS
TRUCK PARTS
MOTORBIKE PARTS
- MOTORBIKE ENGINES
- MOTORBIKE ACCESSORIES

MOTORBIKES
MARINE
FORKLIFTS
MACHINERY
AGRICULTURAL
OTHER
COUNTRY
- AUSTRALIA
- CANADA
- KENYA
- MYANMAR
- NEW ZEALAND
- PAKISTAN
- TANZANIA
- UNITED STATES

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Fri, 11 Apr 2025 02:24:24 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"67f87d58-649b" expires: Tue, 30 Dec 2025 05:45:17 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: F717:1F53DD:9A40AD:AD5195:69536491 accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 05:35:17 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210080-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767072917.319830,VS0,VE207 vary: Accept-Encoding x-fastly-request-id: a5531eca0eab7876fc37170391e046a3352eb759 content-length: 5365 GLUS

GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation

Lang Lin *
Xueyang Yu *
Ziqi Pang *
Yu-Xiong Wang

University of Illinois Urbana-Champaign
* Equal Contribution

arXiv Code

TL;DR

We propose GLUS, unify the distinct challenges of Referring Video Object Segmentation, "ref" and "vos", into a simple framework for MLLMs.

Abstract

This paper proposes a novel framework utilizing multi-modal large language models (MLLMs) for referring video object segmentation (RefVOS). Previous MLLM-based methods commonly struggle with the dilemma between 'Ref' and 'VOS': they either specialize in understanding a few key frames (global reasoning) or tracking objects on continuous frames (local reasoning), and rely on external VOS or frame selectors to mitigate the other end of the challenge. However, our framework GLUS shows that Global and Local consistency can be Unified into a single video Segmentation MLLM: a set of sparse 'context frames' provides global information, while a stream of continuous 'query frames' conducts local object tracking. This is further supported by jointly training the MLLM with a pre-trained VOS memory bank to simultaneously digest short-range and long-range temporal information. To improve the information efficiency within the limited context window of MLLMs, we introduce object contrastive learning to distinguish hard false-positive objects and a self-refined framework to identify crucial frames and perform propagation. By collectively integrating these insights, our GLUS delivers a simple yet effective baseline, achieving new state-of-the-art for MLLMs on the MeViS and Ref-Youtube-VOS benchmark.

GLUS: Global-Local Unified Reasoning Framework for MLLMs

We demonstrate that unifying global and local reasoning into a single MLLM for RefVOS through the design of context and query frames constitutes a simple yet effective baseline method for MLLM-based RefVOS models. We further introduce plug-and-play object contrastive loss and self-refinement with key frame selectors, enabling MLLM to focus on the correct objects and most relevant frames.

Quantiative Results

Our GLUS trained with only RefVOS datasets realizes competitive performance among MLLM-based approaches with datasets from diverse tasks. With expanded training datasets (ED), our GLUS achieves state-of-the-art performance across RefVOS benchmarks, especially in MeViS consisting of complex video scenarios.

Method	MeViS			Ref-Youtube-VOS
Method	J&F	J	F	J&F	J	F
Methods without LLMs
URVOS	27.8	25.7	29.9	47.2	45.2	49.1
LBDT	29.3	27.8	30.8	49.4	48.2	50.6
MTTR	30.0	28.8	31.2	55.3	54.0	56.6
ReferFormer	31.0	29.8	32.2	62.9	61.3	64.6
OnlineRefer	-	-	-	63.5	61.6	65.5
SOC	-	-	-	67.3	65.3	69.3
TempCD	-	-	-	65.8	63.6	68.0
LoSh	-	-	-	64.2	62.5	66.0
LMPM	37.2	34.2	40.2	-	-	-
DsHmp	46.4	43.0	49.8	67.1	65.0	69.1
Methods with LLMs
LISA-7B	37.2	35.1	39.4	53.9	53.4	54.3
LISA-13B	37.9	35.8	40.0	54.4	54.0	54.8
TrackGPT-7B	40.1	37.6	42.6	56.4	55.3	57.4
TrackGPT-13B	41.2	39.2	43.1	59.5	58.1	60.8
VideoGLAMM	45.2	48.1	48.2	-	-	-
VideoLISA-3.8B	44.4	41.3	47.6	63.7	61.7	65.7
VISA-7B	43.5	40.7	46.3	61.5	59.8	63.2
VISA-13B	44.5	41.8	47.1	63.0	61.4	64.7
ViLLa	-	-	-	66.5	64.6	68.6
GLUS (ours)	50.3	47.5	53.2	66.6	65.0	68.3
GLUS (ours) (ED)	51.3	48.5	54.2	67.3	65.5	69.0

Qualitative Results

We provide qualitative comparisons between the previous state-of-the-art (DsHmp) and our GLUS (without extedning datasets) with the videos in MeViS. Notably, these exam- ples illustrate three challenging aspects of RefVOS: (1) Mo- tion Understanding: RefVOS models have to distinguish similar objects with their motions; (2) Global Reasoning: RefVOS models should be capable of using global reason- ing to segment the objects presented only in a short video clip; (3) Vision-Language Reasoning: RefVOS models should perform vision-language unified reasoning in com- plex scenarios. The examples demonstrate that our GLUS effectively tackles RefVOS in challenging language- guided segmentation cases.

Elephant Crushed by another elephant's trunk

Ground Truth

Baseline

GLUS

The panda that has stayed in place with little movement

Ground Truth

Baseline

GLUS

The panda that took a few steps to the left

Ground Truth

Baseline

GLUS

White car move and turn left

Ground Truth

Baseline

GLUS

Person standing behind little girl feeding rabbit

Ground Truth

Baseline

GLUS

Plane moves slower

Ground Truth

Baseline

GLUS

Conclusions

We introduce a simple yet effective framework based on multimodal large language models (MLLM) for referring video object segmentation (RefVOS). Named GLUS, our method establishes unified global and local reasoning in a single LLM, addressing the distinct 'Ref' and 'VOS' challenges of RefVOS. The central design is to provide MLLM with both global (context frames) and local (query frames) contexts. Such unified global-local reasoning is further enhanced with end-to-end optimization with VOS memory modules, which improves the consistency of GLUS. Finally, GLUS introduces plug-and-play object contrastive loss and pseudo-labeling for key frame selection, enabling the MLLM to distinguish the correct object and frame with its limited context window. Our GLUS establishes the new state of the arts on RefVOS benchmarks. We hope our baseline can inspire more systematic studies enabling MLLMs to fine-grained video understanding.

Citation

@inproceedings{lin2025glus, title={GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation}, author={Lin, Lang and Yu, Xueyang and Pang, Ziqi and Wang, Yu-Xiong}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2025} }

Acknowledgements

The website template was borrowed from Michaël Gharbi, Ref-NeRF, and ReconFusion.

HOME
ABOUT
AUCTIONS
SHIPPING
FEES
TOOLS
HOW
FAQ
CONTACT

Original Source | Taken Source