| CARVIEW |
Introduction
Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F ) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions.
Figure: Examples of video clips from the coMplex video Object SEgmentation (MOSEv2) dataset. The selected target objects are masked in orange ▇. The most notable features of MOSEv2 include both challenges inherited from MOSEv1 such as disappearance-reappearance of objects (①-⑩), small/inconspicuous objects (①,③,⑥), heavy occlusions, and crowded scenarios (①,②), as well as newly introduced complexities including adverse weather conditions (⑥), low-light environments (⑤-⑦), multi-shots (⑧), camouflaged objects (⑤), non-physical objects like shadows (④), and knowledge dependency (⑨,⑩). The goal of MOSEv2 dataset is to provide a platform that promotes the development of more comprehensive and robust video object segmentation algorithms.
Statistics
Categories including novel and rare objects
Videos of dense scenarios
Objects with complex movement pattern
High-quality Mask annotations
Table 1. Statistical comparison between MOSEv2 and existing video object segmentation and tracking datasets.
• “Annotations”: number of annotated masks or boxes.
• “Duration”: the total duration of annotated videos, in minutes by default unless noted.
• “Disapp. Rate”: the frequency of objects disappearing in at least one frame, while
• “Reapp. Rate”: the frequency of objects that previously disappeared and later reappear.
• “Distractors” quantifies scene crowding as the average number of visually similar objects per target in the first frame.
*SA-V uses the combination of manual and auto annotations.
Demo
Tasks & Evaluation
We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities.
Dataset
Dataset Download
The dataset is available for non-commercial research purpose only. Please follow the following links to download.
Evaluation
Please submit your results of Val set here:
We strongly encourage you to evaluate with the MOSEv2 dataset. MOSEv1 is for legacy support only and may be deprecated in the future.
Data
MOSEv2 is a comprehensive video object segmentation dataset designed to advance VOS methods under real-world conditions. It consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories.
- Following DAVIS, we use Region Jaccard J, Boundary F measure F, and their mean J&F as the evaluation metrics.
- For MOSEv2, a modified Boundary F measure (Ḟ) is used, J&Ḟd and J&Ḟr are employed to evaluate the results on disapperance and reappearance clips, respectively.
- For the validation sets, the first-frame annotations are released to indicate the objects that are considered in evaluation.
- The test set online evaluation server will be open during the competition period only.
To ensure fair comparison among participants and avoid leaderboard overfitting through repeated trial-and-error, the test set is only available during official competition periods. Please note that for each competition, the released testing videos are randomly sampled from the test set, and will not remain the same across different competitions. This further ensures fairness and prevents overfitting to a fixed set.
Data Structure
train/valid.tar.gz
│
├── Annotations
│ ├── video_name_1
│ │ ├── 00000.png
│ │ ├── 00001.png
│ │ └── ...
│ └── video_name_...
│ └── ...
│
└── JPEGImages
├── video_name_1
│ ├── 00000.png
│ ├── 00001.png
│ └── ...
└── video_name_...
└── ...
People

Henghui Ding
Fudan University
Kaining Ying
Fudan University
Chang Liu
SUFE
Shuting He
SUFE
Xudong Jiang
NTU
Yu-Gang Jiang
Fudan University
Philip H.S. Torr
University of Oxford
Song Bai
BytedanceCitation
Please consider to cite MOSE if it helps your research.
@article{MOSEv2,
title={{MOSEv2}: A More Challenging Dataset for Video Object Segmentation in Complex Scenes},
author={Ding, Henghui and Ying, Kaining and Liu, Chang and He, Shuting and Jiang, Xudong and Jiang, Yu-Gang and Torr, Philip HS and Bai, Song},
journal={arXiv preprint arXiv:2508.05630},
year={2025}
}
@inproceedings{MOSE,
title={{MOSE}: A New Dataset for Video Object Segmentation in Complex Scenes},
author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
booktitle={ICCV},
year={2023}
}
Our related works on video segmentation:
@article{MeViSv2,
title={MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation},
author={Ding, Henghui and Liu, Chang and He, Shuting and Ying, Kaining and Jiang, Xudong and Loy, Chen Change and Jiang, Yu-Gang},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2025},
publisher={IEEE}
}
@inproceedings{MeViS,
title={{MeViS}: A Large-scale Benchmark for Video Segmentation with Motion Expressions},
author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Loy, Chen Change},
booktitle={ICCV},
year={2023}
}
@inproceedings{GRES,
title={{GRES}: Generalized Referring Expression Segmentation},
author={Liu, Chang and Ding, Henghui and Jiang, Xudong},
booktitle={CVPR},
year={2023}
}
@article{VLT,
title={{VLT}: Vision-language transformer and query generation for referring segmentation},
author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2023},
publisher={IEEE}
}





















