CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://ppetrichor.github.io/levitor.github.io/ x-github-request-id: D230:2685F2:89E7BB:9B0104:69525F59 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 11:00:42 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210028-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767006042.377944,VS0,VE201 vary: Accept-Encoding x-fastly-request-id: 4eadad6e9c32cf51a9607e54880056f55354378a content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Mon, 06 Jan 2025 04:30:39 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"677b5c6f-932c" expires: Mon, 29 Dec 2025 11:10:42 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 5FFF:2D8B9D:8AFDF0:9C1750:69525F57 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 11:00:42 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210028-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767006043.608500,VS0,VE220 vary: Accept-Encoding x-fastly-request-id: 4413eec03fb888a9f33938d9d4cdca1c9b9bffac content-length: 5199

LeviTor: 3D Trajectory Oriented <br> Image-to-Video Synthesis

LeviTor: 3D Trajectory Oriented
Image-to-Video Synthesis

Hanlin Wang^1,2, Hao Ouyang², Qiuyu Wang², Wen Wang^3,2,
Ka Leong Cheng^4,2, Qifeng Chen⁴, Yujun Shen², Limin Wang^†,1

¹State Key Laboratory for Novel Software Technology, Nanjing University,
²Ant Group, ³Zhejiang University, ⁴The Hong Kong University of Science and Technology
^†corresponding author

arXiv Code

Hugging Face

Showcases Comparisons Ablations

Abstract

The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity. We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points. These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal. Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images.

Note: Please refresh the webpage if the GIFs appear to be out of sync.

🎵 We recommend watching the video with sound on 🎵

Showcases

Controlled Occlusion Generation with The Same User-Interative Trajectory

Start Image & User Input	Generation Results	Start Image & User Input	Generation Results	Start Image & User Input	Generation Results

Start Image & User Input	Generation Results	Start Image & User Input	Generation Results	Start Image & User Input	Generation Results

In the example of bell swinging, the 2D trajectory shows the bell swinging to the right first and then to the left. By assigning different depth values, two distinct swinging trajectories are achieved. The top bell first leans to the back-right and then to the front-left, while the bottom bell first leans to the front-right and then to the back-left.

Better Control for Forward and Backward Object Movements in relation to the Lens

Start Image & User Input	Generation Results					Start Image & User Input	Generation Results					Start Image & User Input	Generation Results

Implementation of Complex Motions like Orbiting

Start Image & User Input	Generation Results					Start Image & User Input	Generation Results					Start Image & User Input	Generation Results

Comparisons

Controlled Occlusion Generation with The Same User-Interative Trajectory

Start Image & User Input (Ours)	Generation Results (Ours)	Start Image & User Input (DragAnything)	Generation Results (DragAnything)

Start Image & User Input (Ours)	Generation Results (Ours)	Start Image & User Input (DragNUWA)	Generation Results (DragNUWA)

Start Image & User Input (Ours)	Generation Results (Ours)	Start Image & User Input (DragAnything)	Generation Results (DragAnything)

Start Image & User Input (Ours)	Generation Results (Ours)	Start Image & User Input (DragNUWA)	Generation Results (DragNUWA)

Better Control for Forward and Backward Object Movements in relation to the Lens

Start Image & User Input (Ours)	Generation Results (Ours)					Generation Results (DragAnything)					Generation Results (DragNUWA)

Start Image & User Input (Ours)	Generation Results (Ours)					Generation Results (DragAnything)					Generation Results (DragNUWA)

Implementation of Complex Motions like Orbiting

Start Image & User Input (Ours)	Generation Results (Ours)					Generation Results (DragAnything)					Generation Results (DragNUWA)

Start Image & User Input (Ours)	Generation Results (Ours)					Generation Results (DragAnything)					Generation Results (DragNUWA)

Ablations

Ablation on Depth and Instance Information

Start Image & User Input	Generation Results (Ours)	Generation Results (w/o Instance)	Generation Results (w/o Depth)

Ablation on the Number of Inference Control Points

Start Image & User Input	Generation Results with Default Points	Generation Results with Dense Points					Start Image & User Input	Generation Results with Default Points	Generation Results with Dense Points

								Start Image & User Input	Generation Results with Default Points	Generation Results with Dense Points

Comparison with Single-Point Controlled Video Synthesis

Start Image & User Input	Generation Results with Default Points	Generation Results with Single-Point					Start Image & User Input	Generation Results with Default Points	Generation Results with Single-Point

								Start Image & User Input	Generation Results with Default Points	Generation Results with Single-Point

Original Source | Taken Source

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Abstract

Showcases

Comparisons

Ablations

LeviTor: 3D Trajectory Oriented
Image-to-Video Synthesis