CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Mon, 25 Nov 2024 14:10:24 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"67448550-21ab6" expires: Sun, 28 Dec 2025 21:33:42 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: FE03:328FD3:806616:900D22:69519FDE accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 05:33:04 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210065-BOM x-cache: HIT x-cache-hits: 0 x-timer: S1767072784.857666,VS0,VE212 vary: Accept-Encoding x-fastly-request-id: dd9005191b4aedf0d299191c7aef82aaf3a76b21 content-length: 15404 Breathing Life Into Sketches Using Text-to-Video Priors

Breathing Life Into Sketches Using
Text-to-Video Priors

Rinon Gal^*,1,2, Yael Vinker^*,1, Yuval Alaluf¹, Amit Bermano¹,
Daniel Cohen-Or¹, Ariel Shamir³ Gal Chechik²

¹Tel Aviv University, ²NVIDIA, ³Reichman University
^*Indicates Equal Contribution. Order determined by coin flip (:

CVPR 2024 (Highlight)

arXiv Paper Code

Gallery Comparisons Varying the Prompt Text Prompt Effect Abstraction Levels Sketch Representation Trade-off Hyperparameter Effects Video Model Priors Ablation Limitations

We recommend watching the video with sound on

Abstract

A sketch is one of the most intuitive and versatile tools humans use to convey their ideas visually. An animated sketch opens another dimension to the expression of ideas and is widely used by designers for a variety of purposes. Animating sketches is a laborious process, requiring extensive experience and professional design skills. In this work, we present a method that automatically adds motion to a single-subject sketch (hence, ``breathing life into it''), merely by providing a text prompt indicating the desired motion. The output is a short animation provided in vector representation, which can be easily edited. Our method does not require extensive training, but instead leverages the motion prior of a large pretrained text-to-video diffusion model using a score-distillation loss to guide the placement of strokes. To promote natural and smooth motion and to better preserve the sketch's appearance, we model the learned motion through two components. The first governs small local deformations and the second controls global affine transformations. Surprisingly, we find that even models that struggle to generate sketch videos on their own can still serve as a useful backbone for animating abstract representations.

Go to Top

Gallery

How does it work?

Representation

We represent a sketch as a set of strokes placed over a white background. Each stroke is a two dimensional Bézier curve (blue) with four control points (red). Our method predicts an offset for each point (green) for every frame. These offsets deform the sketch in order to create the appearance of motion.

Network

To predict the offsets we train a "neural displacement field", a small MLP that maps the initial sketch coordinates into their per-frame offsets. The network has two paths: A local path (green) which predicts an unconstrained offset for each point, and a global path (blue) which predicts the parameters of a global affine transformation matrix for each frame. This allows the model to focus on small local changes (e.g., bending an arm) while simultaneously creating large global movements or synchronized effects such as shrinking an object as it moves away from the camera.

Training

To train this network, we leverage the motion prior encapsulated in a pre-trained text-to-video model. We begin by predicting the offset for each control point, adjusting the sketch to create all video frames, and rendering them using a differentiable rasterizer. We then use the standard SDS loss in order to extract a signal from the pre-trained diffusion model.

Go to Top

Comparisons to Prior Work

Gen2

ModelScope

VideoCrafter

ZeroScope

Animated Drawings

Ours

Gen2

ModelScope

VideoCrafter

ZeroScope

Animated Drawings

Ours

We compare our method to five baselines: Four image-to-video diffusion models (ZeroScope, ModelScope, VideoCrafter1 and Gen-2 by Runway) and one method tailored for animating children drawings of the human figure. The image-conditioned diffusion models fail to maintain the unique characteristics of the sketch and suffer from multiple visual artifacts. The animated drawings method fares better, but is specifically designed for a humanoid skeleton and for a fixed set of animations. Hence, it cannot create unseen movements (a ballerina's dance) or handle non-human figures such as a fish.
Below we provide the videos used in our quantitative experiement.

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Go to Top

Varying the Prompts

Input

"The boxer is running."

"The boxer is jumping."

"The boxer is punching."

Input

"The gazelle galloped through the grass."

"The gazelle looks around."

"The gazelle jumps."

Input

"The cat is playing."

"The cat is curled up."

"The cat walks forward."

Input

"The biker is pedaling, each leg pumping up and down."

"The bicycle rider jumps over an obstacle."

"The bicycle rider makes a turn at high speed."

Input

"The wine in the wine glass sways from side to side."

"The glass is being filled with wine."

Our method is based on signals from pre-trained text-to-video models. As such, it offers a degree of control over the generated results by simply modifying the prompts that describe the movement. These changes are limited to small motions that the model can create, and that align with the semantics of the initial sketch. Hence, we can fill a glass with wine or ask a boxer to jump, but we may struggle to make the same boxer perform a back-flip.

Fail Case Analysis

Input

"The ballerina is dancing."

"The ballerina performed a grand jete."

"The ballerina bowed."

Prompt-only

Go to Top

Text Prompt Effect

Input Sketch

Baseline

Generic Prompt

Empty Prompt

+"A sketch of .."

+"Abstract sketch. Line drawing"

We examine how the specified prompt affects the animation. We first verify that the text itself influences the results in a meaningful way. To do so, we apply our method to several example sketches, using two alternatives: A "generic prompt" ("the object is moving"), and the empty prompt (""). We further examine the impact of modifying the prompt in a way that would motivate the text-to-video to create a sketch. Specifically, we either prepend the string "A sketch of" or append the string "Abstract sketch. Line drawing" to the prompts.

Go to Top

Abstraction Levels

We demonstrate the performance of our method on sketches with varying levels of abstraction. We employed CLIPasso to generate three sketches for each subject, covering three abstraction levels. These abstractions are achieved by using different numbers of strokes, specifically 16, 8, and 4 strokes. Note that the motion remains apparent even under very abstract settings.

Go to Top

Sketch Representation

Input

Baseline

-lr local

Input

Baseline

-lr local

We illustrate the impact of changing the sketch representation. We applied our method to sketches from the TU-Berlin sketch dataset, a human-drawn class-based sketch dataset. We showcase the results of four representative sketches. Our method was directly applied to the provided SVG files. As can be seen, our method successfully animated the sketches, however their appearance is not fully preserved when using the default hyperparameters. This can be improved by using lower learning rates.

Go to Top

Trade-off

Input Sketch

0.0001

0.0005

0.001

0.005

0.01

We demonstrate the trade-off between the quality of generated motion and the capacity to retain the appearance of the initial sketch. We show the impact of scaling the local learning rate within the range of 0.01 to 0.0001, keeping all parameters constant except for the local learning rate. Observe that as we move from the left (0.0001) to right (0.1), the motion in the animations increases, better aligning with the text prompt but at the cost of preserving the original sketch's appearance. This trade-off introduces additional control for the user, who may prioritize stronger motion over sketch fidelity.

Go to Top

Hyperparameter Effects

Input Sketch

Baseline

+ lr local

+ translation

+ scale

As described in the main paper, there is an inherent tradeoff between the components of our method. Here, we demonstrate how this tradeoff can be utilized to provide further user control over the appearance of the output video by adjusting the method's parameters. It is noteworthy that naturally, we observe different effects across various sketches, which may be attributed to the video model's prior or the initial sketch quality. In the third column ("+lr local"), we showcase the impact of increasing the learning rate of the local MLP. As evident, in some cases (biking and butterfly), this results in stronger motion without compromising the sketch's appearance. However, in other cases (cobra and boat), it affects the fidelity of the sketch, leading to a complete alteration of the original sketch. In the fourth column ("+translation"), we increased the translation prediction weight. As observed, this indeed causes the objects to move more across the frame compared to the baseline.