| CARVIEW |
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
ICCV 2025
We re-purpose Text-to-Video diffusion models to segment any spatio-temporal entity given a referring text.
Hover over any GIF to display the input text prompt for the predicted mask
Zero-shot results on our eval-only benchmark Ref-VPS (first 4 rows) and other interesting samples.
Abstract
We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method capitalizes on visual-language representations learned by video diffusion models on Internet-scale datasets. A key insight of our approach is preserving as much of the generative model's original representation as possible, while fine-tuning it on narrow-domain Referring Object Segmentation datasets. As a result, our framework can accurately segment and track rare and unseen objects, despite being trained on object masks from a limited set of categories. Additionally, it can generalize to non-object dynamic concepts, such as waves crashing in the ocean, as demonstrated in our newly introduced eval-only benchmark for Referring Video Process Segmentation (Ref-VPS). Our experiments show that REM performs on par with state-of-the-art approaches on in-domain datasets, like Ref-DAVIS, while outperforming them by up to twelve points in terms of region similarity on out-of-domain data, leveraging the power of Internet-scale pre-training.
Visual Comparisons - SOTA vs REM (Ours)
Below we show examples from three datasets (Ref-VPS, VSPW, BURST) together with visual comparisons among three methods (UNINEXT, VD-IT, Ours).
BURST results from UNINEXT
BURST results from VD-IT
BURST results from REM (Ours)
VSPW (stuff) results from UNINEXT
VSPW (stuff) results from VD-IT
VSPW (stuff) results from REM (Ours)
Ref-VPS results from UNINEXT
Ref-VPS results from VD-IT
Ref-VPS results from REM (Ours)
Results on highly challenging fighting scenes
Compared to the SOTA, our method REM is much better at consistently segmenting the referred entity through frequent occlusions, pov changes and distortions
.UNINEXT
VD-IT
REM (Ours)