| CARVIEW |
Layer-Aware Video Composition
via Split-then-Merge
TL;DR
Split-then-Merge (StM) is a new video composition framework that improves control and handles data scarcity by splitting unlabeled videos into foreground and background layers, and then self-composing them.
Method
Decomposer
Composer
The StM Decomposer integrates off-the-shelf models to split unlabeled videos. First, motion segmentation generates a foreground mask, which is used to extract the foreground layer. An inpainting model then fills the "holes" in the masked background video. During training, the Composer is trained to reconstruct a ground-truth video latent from foreground, background, and text inputs. A transformation-aware training pipeline and identity-preservation loss ensure the model avoids "copy-paste" shortcuts and learns genuine affordance.
StM-50K Test Examples
Input Foreground
Input Background
StM Result
"A pig is walking in the forest"
Additional Results (Outdoor)
Input Foreground
Input Background
StM Result
"A car is turning at the crossroads"
Additional Results (Indoor)
Input Foreground
Input Background
StM Result
"A pig is wandering around in the balcony"
Logically Impossible Composition
Input Foreground
Input Background
StM Result
"A boat is floating on a road"
Multi Object Composition
Input Foreground
Input Background
StM Result
"A pig is walking indoors"
References
[1] Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. SkyReels-A2: Compose Anything in Video Diffusion Transformers. arXiv preprint arXiv:2504.02436, 2025.
[2] Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen. AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks. Transactions on Machine Learning Research, 2024.
[3] Chenfei Wu et al. Qwen-image technical report. 2025.
[4] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by Example: Exemplar-based Image Editing with Diffusion Models. CVPR 2023.
BibTeX
@article{kara2025stm,
title={Layer-Aware Video Composition via Split-then-Merge},
author={Kara, Ozgur and Chen, Yujia and Yang, Ming-Hsuan and Rehg, James M. and Chu, Wen-Sheng and Tran, Du},
journal={arXiv preprint},
year={2025}
}