| CARVIEW |
CoTracker: It is Better to Track Together
Overview
We introduce CoTracker, a transformer-based model that tracks dense points in a frame jointly across a video sequence. This differs from most existing state-of-the-art approaches that track points independently, ignoring their correlation. We show that joint tracking results in a significantly higher tracking accuracy and robustness.
We also provide several technical innovations, including the concept of virtual tracks, which allows CoTracker to track 70k points jointly and simultaneously. Furthermore, CoTracker operates causally on short windows (hence, it is suitable for online tasks), but is trained by unrolling the windows across longer video sequences, which enables and significantly improves long-term tracking.We demonstrate qualitatively impressive tracking results, where points can be tracked for a long time even when they are occluded or leave the field of view. Quantitatively, CoTracker outperforms all recent trackers on standard benchmarks, often by a substantial margin.
Points on a uniform grid
We track points sampled on a regular grid starting from the initial video frame. The colors represent the object (magenta) and the background (cyan).
|
PIPs
|
RAFT
|
TAPIR
|
CoTracker (Ours)
|
For PIPs, many points are incorrectly tracked and end up being ’stuck’ on the front of the object or the side of the image when they become occluded. RAFT predictions have less noise, but the model fails to handle occlusions, leading to points being lost or stuck on the object. TAPIR predictions are pretty accurate for non-occluded points. When a point becomes occluded, the model struggles to estimate its position. CoTracker produces cleaner and more ’linear’ tracks, which is accurate as the primary motion is a homography (the observer does not translate).
Individual points
We track the same queried point with different methods and visualize its trajectory using color encoding based on time. The red cross (❌) indicates the ground truth point coordinates.
|
TAP-Net
|
PIPs
|
RAFT
|
CoTracker (Ours)
|
Related Links
Several concurrent works have been developed roughly at the same time as our work:
TAPIR is a feed-forward point tracker with a matching stage inspired by TAP-Vid and a refinement stage inspired by PIPs. The model demonstrates accurate tracking for visible points. However, it struggles to predict the positions of occluded points.
Tracking Everything Everywhere All At Once optimizes a volumetric representation for each video during test-time, refining estimated correspondences in a canonical space. The model is currently based on RAFT tracks and is less accurate than CoTracker, but it potentially can be used to refine CoTracker tracks
Multi-Flow Tracking conducts optical flow estimation between distant frames and chooses the most reliable chain of optical flows
BibTeX
@InProceedings{karaev2023cotracker,
author = {Nikita Karaev and Ignacio Rocco and Benjamin Graham and Natalia Neverova and Andrea Vedaldi and Christian Rupprecht},
title = {{CoTracker}: It is Better to Track Together},
journal = {arxiv},
year = {2023}
}