| CARVIEW |
Image-to-Video Synthesis
LeviTor: 3D Trajectory Oriented
Image-to-Video Synthesis
Ka Leong Cheng4,2, Qifeng Chen4, Yujun Shen2, Limin Wang†,1
2Ant Group, 3Zhejiang University, 4The Hong Kong University of Science and Technology
†corresponding author
Abstract
The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis.
Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements.
In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory.
That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity.
We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points.
These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal.
Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images.
Note: Please refresh the webpage if the GIFs appear to be out of sync.
Showcases
Controlled Occlusion Generation with The Same User-Interative Trajectory
| Start Image & User Input |
Generation Results | Start Image & User Input |
Generation Results | Start Image & User Input |
Generation Results | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||||||||
| Start Image & User Input |
Generation Results | Start Image & User Input |
Generation Results | Start Image & User Input |
Generation Results | ||||||||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||||||||
In the example of bell swinging, the 2D trajectory shows the bell swinging to the right first and then to the left. By assigning different depth values, two distinct swinging trajectories are achieved. The top bell first leans to the back-right and then to the front-left, while the bottom bell first leans to the front-right and then to the back-left.
Better Control for Forward and Backward Object Movements in relation to the Lens
| Start Image & User Input |
Generation Results | Start Image & User Input |
Generation Results | Start Image & User Input |
Generation Results | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||||||||
Implementation of Complex Motions like Orbiting
| Start Image & User Input |
Generation Results | Start Image & User Input |
Generation Results | Start Image & User Input |
Generation Results | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||||||||
Comparisons
Controlled Occlusion Generation with The Same User-Interative Trajectory
| Start Image & User Input (Ours) |
Generation Results (Ours) |
Start Image & User Input (DragAnything) |
Generation Results (DragAnything) |
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
||||||||||||
| Start Image & User Input (Ours) |
Generation Results (Ours) |
Start Image & User Input (DragNUWA) |
Generation Results (DragNUWA) |
||||||||||||
![]() |
![]() |
![]() |
![]() |
||||||||||||
| Start Image & User Input (Ours) |
Generation Results (Ours) |
Start Image & User Input (DragAnything) |
Generation Results (DragAnything) |
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
||||||||||||
| Start Image & User Input (Ours) |
Generation Results (Ours) |
Start Image & User Input (DragNUWA) |
Generation Results (DragNUWA) |
||||||||||||
![]() |
![]() |
![]() |
![]() |
||||||||||||
Better Control for Forward and Backward Object Movements in relation to the Lens
| Start Image & User Input (Ours) |
Generation Results (Ours) |
Generation Results (DragAnything) |
Generation Results (DragNUWA) |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
||||||||
| Start Image & User Input (Ours) |
Generation Results (Ours) |
Generation Results (DragAnything) |
Generation Results (DragNUWA) |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
||||||||
Implementation of Complex Motions like Orbiting
| Start Image & User Input (Ours) |
Generation Results (Ours) |
Generation Results (DragAnything) |
Generation Results (DragNUWA) |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
||||||||
| Start Image & User Input (Ours) |
Generation Results (Ours) |
Generation Results (DragAnything) |
Generation Results (DragNUWA) |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
||||||||
Ablations
Ablation on Depth and Instance Information
| Start Image & User Input | Generation Results (Ours) |
Generation Results (w/o Instance) |
Generation Results (w/o Depth) |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Ablation on the Number of Inference Control Points
| Start Image & User Input |
Generation Results with Default Points | Generation Results with Dense Points | Start Image & User Input |
Generation Results with Default Points | Generation Results with Dense Points | ||||
|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||||
| Start Image & User Input |
Generation Results with Default Points |
Generation Results with Dense Points |
||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
||||||||||||||||
Comparison with Single-Point Controlled Video Synthesis
| Start Image & User Input |
Generation Results with Default Points | Generation Results with Single-Point | Start Image & User Input |
Generation Results with Default Points | Generation Results with Single-Point | ||||
|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||||
| Start Image & User Input |
Generation Results with Default Points |
Generation Results with Single-Point |
||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
||||||||||||||||





























































