| CARVIEW |
MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation
| Inputs | Camera: Static | Camera: Dolly out | Camera: Orbit right + Pedestal up | |
|---|---|---|---|---|
![]() |
||||
| Object Motions | ![]() |
|||
![]() |
||||
Abstract
This paper presents a method that allows users to design cinematic video shots in the context of image-to-video generation. Shot design, a critical aspect of filmmaking, involves meticulously planning both camera movements and object motions in a scene. However, enabling intuitive shot design in modern image-to-video generation systems presents two main challenges: first, effectively capturing user intentions on the motion design, where both camera movements and scene-space object motions must be specified jointly; and second, representing motion information that can be effectively utilized by a video diffusion model to synthesize the image animations.
To address these challenges, we introduce MotionCanvas, a method that integrates user-driven controls into image-to-video (I2V) generation models, allowing users to control both object and camera motions in a scene-aware manner. By connecting insights from classical computer graphics and contemporary video generation techniques, we demonstrate the ability to achieve 3D-aware motion control in I2V synthesis without requiring costly 3D-related training data. MotionCanvas enables users to intuitively depict scene-space motion intentions, and translates them into spatiotemporal motion-conditioning signals for video diffusion models. We demonstrate the effectiveness of our method on a wide range of real-world image content and shot-design scenarios, highlighting its potential to enhance the creative workflows in digital content creation and adapt to various image and video editing applications.
Comparison & Showcases
Effectiveness in Cinematic Shot Design (Joint camera and object motion control in a 3D-scene-aware manner).
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Pedestal up + Dolly in ]
|
![]() |
![]() |
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Static ]
|
![]() |
![]() |
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Roll clockwise ]
|
![]() |
![]() |
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Tilting up ]
|
![]() |
⌀
|
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Dolly in ]
|
![]() |
⌀
|
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Trucking right + Pedestal up ]
|
![]() |
⌀
|
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Orbit right ]
|
⌀
|
![]() |
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Panning left + Tilting up ]
|
![]() |
⌀
|
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Dolly in ]
|
![]() |
⌀
|
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Pedestal down + Tiliting up ]
|
![]() |
⌀
|
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Panning left + Dolly in ]
|
⌀
|
⌀
|
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Panning left ]
|
![]() |
⌀
|
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Trucking right ]
|
![]() |
⌀
|
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Static ]
|
![]() |
⌀
|
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Trucking left + Dolly out ]
|
![]() |
⌀
|
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
| Camera motion | Object global motion | Object local motion |
|---|---|---|
|
[ Panning left + Pedestal up + Orbit right ]
|
⌀
|
⌀
|
| DragAnything | MOFA-Video | MotionCanvas (Ours) |
Applications
Shot Design with Joint Camera and Object Control.
| Inputs | Camera: Trucking right | Camera: Zoom in | Camera: Roll clockwise |
|---|---|---|---|
![]() |
|||
![]() |
|||
![]() |
|||
| Inputs | Camera: Static | Camera: Dolly in | Camera: Diagonal bottom-right |
![]() |
|||
![]() |
|||
![]() |
| Camera: Dolly out | Camera: Orbit left | Camera: Pedestal up |
|---|---|---|
| Camera: Orbit left | Camera: [Trcuking left + Pedestal up] | Camera: Dolly in |
|---|---|---|
| Camera: Dolly out | Camera: Dolly in | Camera: Trcuking left |
|---|---|---|
Long Videos with Complex Trajectories.
| Input image | Motion control signal | Result sample #1 | Result sample #2 |
|---|---|---|---|
![]() |
![]() |
||
![]() |
![]() |
||
| Input image | Result sample #1 | Result sample #2 |
|---|---|---|
![]() |
||
![]() |
||
![]() |
||
![]() |
||
![]() |
||
Object Local Motion Control.
| Inputs | ![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|
| Results | ||||
| Inputs | ![]() |
![]() |
![]() |
![]() |
| Results | ||||
| Inputs | ![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|
| Results | ||||
| Inputs | ![]() |
![]() |
![]() |
![]() |
| Results | ||||
Additional Applications
Motion Transfer.
| Input source video | |||
| Transfer results | |||
Video Editing.
| Input video | |||
| Editing results | |||
Comparisons with Baseline Methods
Camera Motion Control.
| Input camera control | MotionCtrl | CameraCtrl | Ours |
|---|---|---|---|
|
[ Dolly in + Zoom Out ]
(Dolly zoom) |
|||
|
[ Trucking right ]
|
|||
Object Motion Control.
| Inputs | DragAnything | MOFA-Video |
|---|---|---|
![]() |
||
| Camera | TrackDiffusion | Ours |
|
[ Static ]
|
||
| Inputs | DragAnything | MOFA-Video |
![]() |
||
| Camera | TrackDiffusion | Ours |
|
[ Trucking right ]
|
||
Camera Motion Control on DAVIS.
| Reference | MotionCtrl | CameraCtrl | Ours |
|---|---|---|---|
Ablation Study
Camera Motion Representation.
| Input | Gauss. Map | Plucker | Point Traj Coeff. (Ours) |
|---|---|---|---|
|
[ Dolly out + Panning right ]
|
|||
| Input | Gauss. Map | Plucker | Point Traj Coeff. (Ours) |
|
[ Roll clockwise + Zoom out ]
|
|||
Bounding Box Conditioning.
| Input | Ourscoord | Ours |
|---|---|---|
![]() |
||
| Input | Ourscoord | Ours |
![]() |
||
Additional Analysis
Effect of Point Track Density on Camera Motion Control
| Density=0.1 | Density=0.4 | Density=0.7 | Density=1.0 | |
|---|---|---|---|---|
| Input point track | ||||
| Results | ||||
Effect of Text Prompt.
- We show camera motion control of "dolly in" with different levels of text detail.
| "A man." | "A man crossing a stream." | "A man with a red backpack steps over a stream in a mountain valley." |
|---|---|---|
| "A man wearing a blue flannel shirt, hiking boots, and a red backpack carefully steps across a rocky stream in a picturesque valley surrounded by rugged mountains." | "A man crossing a stream. It is raining." | "A man crossing a stream and turning around." |
Essentiality of Camera-aware and Camera-object-aware Transformations
| - | Inputs | Preview | w/o transform | w/ transform (Ours) |
|---|---|---|---|---|
| Camera-aware transformation | ![]() |
|||
![]() |
||||
| Camera-object-aware transformation | ![]() |
|||
Large Camera-motion Results
MotionCanvas 32-frame version.





















































