Carview!

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Tue, 16 Dec 2025 20:45:00 GMT access-control-allow-origin: * etag: W/"6941c4cc-9e4e" expires: Mon, 29 Dec 2025 10:31:22 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: C123:444BC:8B26B1:9C2DE8:69525622 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 10:21:23 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210064-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767003683.841999,VS0,VE208 vary: Accept-Encoding x-fastly-request-id: e6cc11bd2c07f1cabe979abe450a7c97bfc6c42e content-length: 7738 MOSE: Complex Video Object Segmentation Dataset

MOSE

coMplex video Object SEgmentation

Paper Download Evaluation Server (MOSEv2)

MOSEv1 Evaluation Server (MOSEv1) MeViS LSVOS@ICCV PVUW@CVPR

Introduction

Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F ) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions.

Figure: Examples of video clips from the coMplex video Object SEgmentation (MOSEv2) dataset. The selected target objects are masked in orange ▇. The most notable features of MOSEv2 include both challenges inherited from MOSEv1 such as disappearance-reappearance of objects (①-⑩), small/inconspicuous objects (①,③,⑥), heavy occlusions, and crowded scenarios (①,②), as well as newly introduced complexities including adverse weather conditions (⑥), low-light environments (⑤-⑦), multi-shots (⑧), camouflaged objects (⑤), non-physical objects like shadows (④), and knowledge dependency (⑨,⑩). The goal of MOSEv2 dataset is to provide a platform that promotes the development of more comprehensive and robust video object segmentation algorithms.

Statistics

Categories including novel and rare objects

Videos of dense scenarios

Objects with complex movement pattern

High-quality Mask annotations

Table 1. Statistical comparison between MOSEv2 and existing video object segmentation and tracking datasets.

• “Annotations”: number of annotated masks or boxes.
• “Duration”: the total duration of annotated videos, in minutes by default unless noted.
• “Disapp. Rate”: the frequency of objects disappearing in at least one frame, while
• “Reapp. Rate”: the frequency of objects that previously disappeared and later reappear.
• “Distractors” quantifies scene crowding as the average number of visually similar objects per target in the first frame.
*SA-V uses the combination of manual and auto annotations.

Demo

All
MOSEv2
MOSEv1

Heavy occulsion

Frequent disapp. - reapp.

Complex movements

Small & crowded objects

Non-physical objects

Low-light environment

Camouflaged objects

Adverse weather

Multi-shot

Knowledge dependency

Long disapperance

Object deformation

Tasks & Evaluation

We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities.

Table 2. Benchmark results of semi-supervised (one-shot) VOS on MOSEv2.

Dataset

Dataset Download

The dataset is available for non-commercial research purpose only. Please follow the following links to download.

Hugging Face

Google Drive

Evaluation

Please submit your results of Val set here:

Codabench (MOSEv2)

Codabench (MOSEv1)
(Old version)

We strongly encourage you to evaluate with the MOSEv2 dataset. MOSEv1 is for legacy support only and may be deprecated in the future.

Data

MOSEv2 is a comprehensive video object segmentation dataset designed to advance VOS methods under real-world conditions. It consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories.

Following DAVIS, we use Region Jaccard J, Boundary F measure F, and their mean J&F as the evaluation metrics.
For MOSEv2, a modified Boundary F measure (Ḟ) is used, J&Ḟ_d and J&Ḟ_r are employed to evaluate the results on disapperance and reappearance clips, respectively.
For the validation sets, the first-frame annotations are released to indicate the objects that are considered in evaluation.
The test set online evaluation server will be open during the competition period only.
To ensure fair comparison among participants and avoid leaderboard overfitting through repeated trial-and-error, the test set is only available during official competition periods. Please note that for each competition, the released testing videos are randomly sampled from the test set, and will not remain the same across different competitions. This further ensures fairness and prevents overfitting to a fixed set.

Data Structure

 train/valid.tar.gz
│
├── Annotations
│ ├── video_name_1
│ │ ├── 00000.png
│ │ ├── 00001.png
│ │ └── ...
│ └── video_name_...
│   └── ...
│ 
└── JPEGImages
  ├── video_name_1
  │ ├── 00000.png
  │ ├── 00001.png
  │ └── ...
  └── video_name_...
    └── ...

People

Citation

Please consider to cite MOSE if it helps your research.

@article{MOSEv2,
  title={{MOSEv2}: A More Challenging Dataset for Video Object Segmentation in Complex Scenes},
  author={Ding, Henghui and Ying, Kaining and Liu, Chang and He, Shuting and Jiang, Xudong and Jiang, Yu-Gang and Torr, Philip HS and Bai, Song},
  journal={arXiv preprint arXiv:2508.05630},
  year={2025}
}
@inproceedings{MOSE,
  title={{MOSE}: A New Dataset for Video Object Segmentation in Complex Scenes},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
  booktitle={ICCV},
  year={2023}
}

Our related works on video segmentation:

@article{MeViSv2,
  title={MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Ying, Kaining and Jiang, Xudong and Loy, Chen Change and Jiang, Yu-Gang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2025},
  publisher={IEEE}
}
@inproceedings{MeViS,
  title={{MeViS}: A Large-scale Benchmark for Video Segmentation with Motion Expressions},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Loy, Chen Change},
  booktitle={ICCV},
  year={2023}
}
@inproceedings{GRES,
  title={{GRES}: Generalized Referring Expression Segmentation},
  author={Liu, Chang and Ding, Henghui and Jiang, Xudong},
  booktitle={CVPR},
  year={2023}
}
@article{VLT,
  title={{VLT}: Vision-language transformer and query generation for referring segmentation},
  author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2023},
  publisher={IEEE}
}

Original Source | Taken Source

MOSE

MOSE

Introduction

Statistics

Demo

Tasks & Evaluation

Dataset

Dataset Download

Hugging Face

Google Drive

Evaluation

Codabench (MOSEv2)

Codabench (MOSEv1)
(Old version)

Data

Data Structure

People

Henghui Ding

Kaining Ying

Chang Liu

Shuting He

Xudong Jiang

Yu-Gang Jiang

Philip H.S. Torr

Song Bai

Citation