Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Abstraction Levels




We demonstrate the performance of our method on sketches with varying levels of abstraction. We employed CLIPasso to generate three sketches for each subject, covering three abstraction levels. These abstractions are achieved by using different numbers of strokes, specifically 16, 8, and 4 strokes. Note that the motion remains apparent even under very abstract settings.

Sketch Representation


We illustrate the impact of changing the sketch representation. We applied our method to sketches from the TU-Berlin sketch dataset, a human-drawn class-based sketch dataset. We showcase the results of four representative sketches. Our method was directly applied to the provided SVG files. As can be seen, our method successfully animated the sketches, however their appearance is not fully preserved when using the default hyperparameters. This can be improved by using lower learning rates.

Trade-off


We demonstrate the trade-off between the quality of generated motion and the capacity to retain the appearance of the initial sketch. We show the impact of scaling the local learning rate within the range of 0.01 to 0.0001, keeping all parameters constant except for the local learning rate. Observe that as we move from the left (0.0001) to right (0.1), the motion in the animations increases, better aligning with the text prompt but at the cost of preserving the original sketch's appearance. This trade-off introduces additional control for the user, who may prioritize stronger motion over sketch fidelity.

Hyperparameter Effects


As described in the main paper, there is an inherent tradeoff between the components of our method. Here, we demonstrate how this tradeoff can be utilized to provide further user control over the appearance of the output video by adjusting the method's parameters. It is noteworthy that naturally, we observe different effects across various sketches, which may be attributed to the video model's prior or the initial sketch quality. In the third column ("+lr local"), we showcase the impact of increasing the learning rate of the local MLP. As evident, in some cases (biking and butterfly), this results in stronger motion without compromising the sketch's appearance. However, in other cases (cobra and boat), it affects the fidelity of the sketch, leading to a complete alteration of the original sketch. In the fourth column ("+translation"), we increased the translation prediction weight. As observed, this indeed causes the objects to move more across the frame compared to the baseline.

Comparing Video Models


In the main paper we utilized the publicly available ModelScope pretrained video model. In particular, we look at a set of ZeroScope models, tuned across a range of resolutions and framerates (see https://huggingface.co/cerspense for more details). As observed, our method succesfully generalizes to these models with no additional changes. Note that different models do lead to different motion patterns, and some of them may result in different tradeoffs between the level of motion and the ability to preserve the sketch. For example, zeroscope-v1-320s (third column) resulted in slower motions, while zeroscope-v2-576w (sixth row) produces more "jumpy" videos.

Ablation


We evaluate the main components of our method. Disabling the local path severely restricts the model's ability to capture natural motion, leading to wobbling and sliding effects rather abstract motion that fits the sketch. Disabling the global path, or replacing the neural network with direct optimization, leads to results that largely align with the prompts, but contain a significant amount of temporal jitter and larger deviations from the input sketch.

Limitations


Sketch Representation

There exist many ways to represent sketches in vector format, including different types of curves or primitive shapes (such as lines or polygons), different attributes for the shapes (such as stroke's width, closed shapes, and more), and with different number of parameters. Our selection of hyperparameters and network design is based on one specific sketch representation. Below is an example of a sketch of a surfer, defined by a sequence of closed segments of cubic Bezier curves, and contains a relatively high number of control points. As can be seen, the sketch resulting translation is significanly increased compared to our common results. In addition, the surfer's apperance is not well preserved as its' scale changed significantly.

Two Objects

Our method assumes that the input sketch depicts a single subject (a common scenario in character animation techniques). When applied directly to sketches involving multiple objects, we observe a degradation in result quality due to the inherent design constraints. Here for example, we expect the basketball to seperate form the player's hand, to achieve a natural dribbling motion. However, with out current settings its impossible to achieve such seperation since the translation parameters are relative to the object, which the basketball is part of. This limitation can be solved with further technical developments.

Scene Sketches

In a similar manner, we observe a degradation in result quality when our method is applied directly to scene sketches. As can be seen in this example, the entire scene moves unnaturally because of the single object assumption.

Shape Preservation

While the trade-off between the motion quality and the sketch's fidelity can be controlled by altering the hyperparameters, we still observe that sometimes the sketch's identity is harmed. Here for example, the squirrel's motion is good, but the aspect ratio of the original squirrel changed. It may be possible to improve on this front by leveraging a mesh-based representation of the sketch, and using an approximate rigidity loss.

Video Model Prior

Our approach inherits the general nature of the text-to-video priors, but it also suffers from their limitations. Such models are trained on large-scale data, but they may be unaware of specific motions, portray strong biases due to their data, or producing sever artefacts. Here for example we show the video produced by our text-to-video backbone model for the text "The ballerina is dancing". As can be seen the video is of very low quality, and contains artefacts such as in the ballerina's face and hands. However, our method is agnostic to the backbone model and hence could likely use newer models as they become available.

BibTeX

@InProceedings{Gal_2024_CVPR,
            author    = {Gal, Rinon and Vinker, Yael and Alaluf, Yuval and Bermano, Amit and Cohen-Or, Daniel and Shamir, Ariel and Chechik, Gal},
            title     = {Breathing Life Into Sketches Using Text-to-Video Priors},
            booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
            month     = {June},
            year      = {2024},
            pages     = {4325-4336}
        }
      
 
Original Source | Taken Source