| CARVIEW |
RoMo: Robust Motion Segmentation Improves Structure from Motion
ICCV 2025
Adobe Research, Simon Fraser University
* Equal Contribution † Equal Advising
Abstract
There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. While these tasks rely heavily on known camera poses, the problem of finding such poses using structure-from-motion (SfM) often depends on robustly separating static from dynamic parts of a video. The lack of a robust solution to this problem limits the performance of SfM camera-calibration pipelines. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.
Examples of RoMo Motion Masks
Video Comparisons
Moving object segmentation results on DAVIS16, TrackSegv2 and FBMS 59. We show comparison of our zero-shot method against OCLR-adap, a motion segmentation approach which is fully supervised on synthetic datasets at training time and further adapted to the given videos at test-time.
Input Video
RoMo (ours)
OCLR-adap
Input Video
RoMo (ours)
OCLR-adap
Input Video
RoMo (ours)
OCLR-adap
Input Video
RoMo (ours)
OCLR-adap
Input Video
RoMo (ours)
OCLR-adap
Input Video
RoMo (ours)
OCLR-adap
Input Video
RoMo (ours)
OCLR-adap
Input Video
RoMo (ours)
OCLR-adap
Input Video
RoMo (ours)
OCLR-adap
Input Video
RoMo (ours)
OCLR-adap
Input Video
RoMo (ours)
OCLR-adap
Estimated Camera Trajectories and Motion Masks on Casual Motion Dataset
camera comparison. GT: —, Estimate: ---
Input Video
RoMo (Ours)
camera comparison. GT: —, Estimate: ---
Input Video
RoMo (Ours)
camera comparison. GT: —, Estimate: ---
Input Video
RoMo (Ours)
camera comparison. GT: —, Estimate: ---
Input Video
RoMo (Ours)
RoMo Motion Masks on MPI Sintel Dataset
Application in Distractor Removal
Our method can be applied to the task of robust 3D reconstruction in the presence of transient distractors, when the input set of images is sampled from a video. We show an example of this application on the patio scene from the NeRF On-the-go dataset.
Our masks, applied to the photo-metric loss during training a 3D Gaussian Splatting model, can be as effective as robust training methods such as SpotLessSplats.
BibTeX
@article{golisabour2024romo,
title={{RoMo}: Robust Motion Segmentation Improves Structure from Motion},
author={Goli, Lily and Sabour, Sara and Matthews, Mark and Marcus, Brubaker and Lagun, Dmitry and Jacobson, Alec and Fleet, David J. and Saxena, Saurabh and Tagliasacchi, Andrea},
journal={arXiv:2411.18650},
year={2024}
}