| CARVIEW |
Multiple Object Tracking and Segmentation in Complex Environments
Four challenges in long video, occluded object, diverse motion and open-world
October 24th, 9:00 am (UTC+3), ECCV 2022 Online Workshop
News
[October 22] All technical reports of top teams in four challenges are available now ! Thanks for their sharing.
[July 11] UVO Challenge is open today ! [Dataset Download] [Evaluation Server Image] [Evaluation Server Video]
[July 10] YouTubeVIS: Long Video Challenge is open today ! [Dataset Download] [Evaluation Server]
[July 5] OVIS Challenge is open today ! [Dataset Download] [Evaluation Server]
[July 4] DanceTrack Challenge is open today ! [Dataset Download] [Evaluation Server]
[July 3] Competition Phase 1 is postponed to July 11, 2022 (00:01am UTC Time). Apology for this delay.
Overview
Abstract
Multiple object tracking and segmentation aims to localize and associate objects of interest along time, and serves as fundamental technologies in many practical applications, such as visual surveillance, public security, video analysis, and human-computer interaction.
Computer vision systems nowadays have achieved great performance in simple tracking and segmentation scenes, such as MOT dataset and DAVIS dataset, but are not as robust as the human vision system, especially in complex environments.
To advance current vision systems performance in complex environments, our workshop explores four settings of multiple object tracking and segmentation: (a) long video (b) occluded object (c) diverse motion (d) open-world.
Four challenges consist of:
- 4th YouTubeVIS and Long Video Instance Segmentation Challenge
- 2nd Occluded Video Instance Segmentation Challenge
- 1st Multiple People Tracking in Group Dance Challenge
- 2nd Open-World Video Object Detection and Segmentation Challenge
Challenge
YouTubeVIS: Long Video
Video Instance Segmentation extends the image instance segmentation task from the image domain to the video domain. This problem aims at simultaneous detection, segmentation and tracking of object instances in videos. We extend VIS with long videos for validation and testing, consisting of:- 141 additional long videos, 71 in validation, 70 in test
- 259 additional unique video instances with average duration of 49.8s
- 9304 additional high-quality instance masks
The additional long videos (L) are separately evaluated from previous short videos. We use average precision (AP_L) at different intersection-over-union (IoU) thresholds and average recall (AR_L) as our evaluation metrics. The IoU in video instance segmentation is the sum of intersection area over the sum of union area across the video. For more details about the dataset, please refer to our paper or website.
Dataset DownloadEvaluation Server
OVIS
Occluded Video Instance Segmentation is a new large scale benchmark dataset designed with the philosophy of perceiving object occlusions in videos, which could reveal the complexity and the diversity of real-world scenes. OVIS consists of:- 901 videos with severe object occlusions
- 25 commonly seen semantic categories
- 5,223 unique instances with average duration of 10.05s
- 296k high-quality instance masks
We use average precision (AP) at different intersection-over-union (IoU) thresholds and average recall (AR) as our evaluation metrics. The IoU in video instance segmentation is the sum of intersection area over the sum of union area across the video. For more details about the dataset, please refer to our paper or website.
Dataset DownloadEvaluation Server
DanceTrack
DanceTrack is a multi-human tracking dataset with two emphasized properties, (1) uniform appearance: humans are in highly similar and almost undistinguished appearance, (2) diverse motion: humans are in complicated motion pattern and their relative positions exchange frequently. DanceTrack consists of:- 100 videos of group dance, 40 training videos, 25 validation videos and 35 test videos
- 990 unique instances with average duration of 52.9s
- 877k high-quality bounding boxes
We use Higher Order Tracking Accuracy (HOTA) as the main metric, AssA and IDF1 to measure association performance, DetA and MOTA for detection quality. For more details about the dataset, please refer to our paper or website.
Dataset DownloadEvaluation Server
UVO
Unidentified Video Objects benchmark is aimed at developing computer vision models that can detect and segment all objects that appear in images or videos regardless of their semantic concepts, either known or unknown. UVO is highlighted of:- high quality instance masks annotated at 30 fps on 1024 YouTube videos and 1fps on 10337 videos from Kinetics dataset
- annotating ALL objects in each video, 13.5 objects per video on average
- 57% of objects are not covered by COCO categories
We use average precision (AP) at different intersection-over-union (IoU) thresholds and average recall (AR) as our evaluation metrics. The IoU in video instance segmentation is the sum of intersection area over the sum of union area across the video. For more details about the dataset, please refer to our paper or website.
Dataset DownloadEvaluation Server Image
Evaluation Server Video
Competition Schedule
| Competition | Date |
|---|---|
| Competition Phase 1 (open the submission of the val results) | July 01, 2022 (00:01am UTC Time) |
| Competition Phase 2 (open the submission of the test results) | September 01, 2022 (00:01am UTC Time) |
| Deadline for Submitting the Final Predictions | October 01, 2022 (11:59pm UTC Time) |
| Decisions to Participants | October 05, 2022 (11:59pm UTC Time) |
Top Teams
(* equal contribution)
| Challenge | Rank | Team Name | Team Members | Organization | Technical Report |
|---|---|---|---|---|---|
| YouTubeVIS:Long Video | 1st | IIG | Yong Liu1,2, Jixiang Sun1, Yitong Wang2, Cong Wei1, Yansong Tang1, Yujiu Yang1 | 1Tsinghua Shenzhen International Graduate School, Tsinghua University, 2ByteDance Inc. | IIG |
| YouTubeVIS:Long Video | 2nd | ByteVIS | Junfeng Wu1, Yi Jiang2, Qihao Liu3, Xiang Bai1, Song Bai2 | 1Huazhong University of Science and Technology, 2Bytedance, 3Johns Hopkins University | ByteVIS |
| OVIS | 1st | BeyondSOTA | Fengliang Qi, Jing Xian, Zhuang Li, Bo Yan, Yuchen Hu, Hongbin Wang | Ant Group | BeyondSOTA |
| OVIS | 2nd | IIG | Yong Liu1,2, Jixiang Sun1, Yitong Wang2, Cong Wei1, Yansong Tang1, Yujiu Yang1 | 1Tsinghua Shenzhen International Graduate School, Tsinghua University, 2ByteDance Inc. | IIG |
| DanceTrack | 1st | MOTRv2 | Yuang Zhang1,2, Tiancai Wang1, Weiyao Lin2, Xiangyu Zhang1 | 1MEGVII Technology, 2Shanghai Jiao Tong University | MOTRv2 |
| DanceTrack | 2nd | C-BIoU | Fan Yang, Shigeyuki Odashima, Shoichi Masui, Shan Jiang | Fujitsu Research | C-BIOU |
| DanceTrack | 2nd | mt_iot | Feng Yan, Zhiheng Li, Weixin Luo, Zequn Jie, Fan Liang, Xiaolin Wei, Lin Ma | Meituan | mt_iot |
| DanceTrack | 3rd | DLUT_IIAU | Guangxin Han1, Mingzhan Yang1, Yanxin Liu1, Shiyu Zhu2, Yuzhuo Han2, Xu Jia1, Huchuan Lu1 | 1Dalian University of Technology, 2Honor Device Co.Ltd | DLUT_IIAU |
| UVO | 1st | TAL-BUPT | Jiajun Zhang*1, Boyu Chen*2, Zhilong Ji2, Jinfeng Bai2, Zonghai Hu1 | 1Beijing University of Posts and Telecommunications, 2Tomorrow Advancing Life | TAL-BUPT |
Workshop
Invited Speakers
Workshop Schedule
October 24th, 9:00 am - 13:00 pm(UTC+3)
| Time | Speaker | Topic |
|---|---|---|
| 9:00-9:10 am | Organizers | Welcome |
| 9:10-9:40 am | Invited speaker 1 | Recognizing objects in long time and in a large-vocabulary |
| 9:40-10:10 am | YouTubeVIS:Long Video winners teams | Solutions for 4th YouTubeVIS and Long Video Instance Segmentation Challenge |
| 10:10-10:40 am | Invited speaker 2 | Learning Robust Multiple Object Tracking and Segmentation |
| 10:40-11:10 am | OVIS winners teams | Solutions for 2nd Occluded Video Instance Segmentation Challenge |
| 11:10-11:20 am | Organizers | Break |
| 11:20-11:50 am | DanceTrack winners teams | Solutions for 1st Multiple People Tracking in Group Dance Challenge |
| 11:50-12:20 pm | UVO winners teams | Solutions for 2nd Open-World Video Object Detection and Segmentation Challenge |
| 12:20-13:00 pm | Organizers | Closing |
Organizers