CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Fri, 28 Nov 2025 15:45:06 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"6929c382-8d92" expires: Mon, 29 Dec 2025 19:14:45 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: B202:1387E:929749:A48BDE:6952D0CC accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 19:04:45 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210043-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767035086.685650,VS0,VE205 vary: Accept-Encoding x-fastly-request-id: b13888ab058244363045d9fcb0ce3f90225125d5 content-length: 5288 Layer-Aware Video Composition via Split-then-Merge

Layer-Aware Video Composition
via Split-then-Merge

Ozgur Kara^1† Yujia Chen² Ming-Hsuan Yang² James M. Rehg¹ Wen-Sheng Chu^2‡ Du Tran^2‡

¹University of Illinois Urbana-Champaign ²Google

^†Work done during an internship at Google ^‡Joint last authors

Paper Supplementary Code (Coming Soon) Dataset (Coming Soon)

TL;DR

Split-then-Merge (StM) is a new video composition framework that improves control and handles data scarcity by splitting unlabeled videos into foreground and background layers, and then self-composing them.

Method

Decomposer

Composer

The StM Decomposer integrates off-the-shelf models to split unlabeled videos. First, motion segmentation generates a foreground mask, which is used to extract the foreground layer. An inpainting model then fills the "holes" in the masked background video. During training, the Composer is trained to reconstruct a ground-truth video latent from foreground, background, and text inputs. A transformation-aware training pipeline and identity-preservation loss ensure the model avoids "copy-paste" shortcuts and learns genuine affordance.

StM-50K Test Examples

Input Foreground

Input Background

StM Result

"A pig is walking in the forest"

Additional Results (Outdoor)

Input Foreground

Input Background

StM Result

"A car is turning at the crossroads"

Additional Results (Indoor)

Input Foreground

Input Background

StM Result

"A pig is wandering around in the balcony"

Logically Impossible Composition

Input Foreground

Input Background

StM Result

"A boat is floating on a road"

Multi Object Composition

Input Foreground

Input Background

StM Result

"A pig is walking indoors"

SOTA Comparison

Input Foreground

Input Background

StM (Ours)

SkyReels [1]

AnyV2V [2]

Qwen+I2V [3]

PBE+I2V [4]

Copy-Paste

"A goat is running on a snowy road"

Input Foreground

Input Background

StM (Ours)

SkyReels [1]

AnyV2V [2]

Qwen+I2V [3]

PBE+I2V [4]

Copy-Paste

"A boat is moving on the water"

References

[1] Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. SkyReels-A2: Compose Anything in Video Diffusion Transformers. arXiv preprint arXiv:2504.02436, 2025.

[2] Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen. AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks. Transactions on Machine Learning Research, 2024.

[3] Chenfei Wu et al. Qwen-image technical report. 2025.

[4] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by Example: Exemplar-based Image Editing with Diffusion Models. CVPR 2023.

BibTeX

@article{kara2025stm,
  title={Layer-Aware Video Composition via Split-then-Merge},
  author={Kara, Ozgur and Chen, Yujia and Yang, Ming-Hsuan and Rehg, James M. and Chu, Wen-Sheng and Tran, Du},
  journal={arXiv preprint},
  year={2025}
}

Original Source | Taken Source