SAM 2 is the first unified model for segmenting objects across images and videos. You can use a click, box, or mask as the input to select an object on any image or frame of video.
Select objects and make adjustments across video frames
Using SAM 2, you can select one or multiple objects in a video frame. Use additional prompts to refine the model predictions.
Robust segmentation, even in unfamiliar videos
SAM 2 is capable of strong zero-shot performance for objects, images and videos not previously seen during model training, enabling use in a wide range of real-world applications.
Real-time interactivity and results
SAM 2 is designed for efficient video processing with streaming inference to enable real-time, interactive applications.
State-of-the-art performance for object segmentation
SAM 2 outperforms the best models in the field for object segmentation in videos and images.
Highlights
SAM 2 improves on SAM for segmentation in images
SAM 2 outperforms existing video object segmentation models, especially for tracking parts
SAM 2 requires less interaction time than existing interactive video segmentation methods
Try it yourself
Track an object across any video interactively with as little as a single click on one frame, and create fun effects.
SAM 2 brings state-of-the-art video and image segmentation capabilities into a single model, while preserving a simple design and fast inference speed.
Model architecture
Meta Segment Anything Model 2 design
The SAM 2 model extends the promptable capability of SAM to the video domain by adding a per session memory module that captures information about the target object in the video. This allows SAM 2 to track the selected object throughout all video frames, even if the object temporarily disappears from view, as the model has context of the object from previous frames. SAM 2 also supports the ability to make corrections in the mask prediction based on additional prompts on any frame.
SAM 2’s streaming architecture—which processes video frames one at a time—is also a natural generalization of SAM to the video domain. When SAM 2 is applied to images, the memory module is empty and the model behaves like SAM.
The Segment Anything Video Dataset
A large and diverse video segmentation dataset
SAM 2 was trained on a large and diverse set of videos and masklets (object masks over time), created by applying SAM 2 interactively in a model in the loop data-engine. The training data includes the SA-V dataset, which we are open sourcing.
Geographically diverse, real world scenarios collected across 47 countries
Annotations include whole objects, parts, and challenging occlusions
Access our research
Open innovation
To enable the research community to build upon this work, we’re publicly releasing a pretrained Segment Anything 2 model, along with the SA-V dataset, a demo, and code.
The video object segmentation outputs from SAM 2 could be used as input to other AI systems such as modern video generation models to enable precise editing capabilities.
Extensible inputs
SAM 2 can be extended to take other types of input prompts such as in the future enabling creative ways of interacting with objects in real-time or live video.