Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language




Conclusions

We introduce a simple yet effective framework based on multimodal large language models (MLLM) for referring video object segmentation (RefVOS). Named GLUS, our method establishes unified global and local reasoning in a single LLM, addressing the distinct 'Ref' and 'VOS' challenges of RefVOS. The central design is to provide MLLM with both global (context frames) and local (query frames) contexts. Such unified global-local reasoning is further enhanced with end-to-end optimization with VOS memory modules, which improves the consistency of GLUS. Finally, GLUS introduces plug-and-play object contrastive loss and pseudo-labeling for key frame selection, enabling the MLLM to distinguish the correct object and frame with its limited context window. Our GLUS establishes the new state of the arts on RefVOS benchmarks. We hope our baseline can inspire more systematic studies enabling MLLMs to fine-grained video understanding.

Citation

Acknowledgements



The website template was borrowed from Michaƫl Gharbi, Ref-NeRF, and ReconFusion.