| CARVIEW |
AnimateZero:
Video Diffusion Models are Zero-Shot Image Animators
Jiwen Yu1
Xiaodong Cun2*
Chenyang Qi3
Yong Zhang2
Xintao Wang2
Ying Shan2
Jian Zhang1*
*Corresponding Author.
1Peking University
2Tencent AI Lab
3HKUST
Generated Image
Output Video
Generated Image
Output Video
Generated Image
Output Video
Our Unique Insights and Proposed Zero-Shot Methodology
Vanilla Video Diffusion Models (VDMs) have the following issues demonstrated in (a) of the following figure:
- Black box: The generation process is still a black box
- Inefficient and uncontrollabe: to obtain satisfactory results, a large amount of trial and error is needed
- Domain gap: limited by the domain of the video dataset used during training
We propose step-by-step video generation pipeline to address the above issues demonstrated in (b) of the following figure:
- Decoupling: the video generation process is decoupled into appearance (T2I) and motion (I2V)
- Efficient and controllable: T2I generation is more controllable and efficient compared to T2V, allowing us to obtain satisfactory images before performing I2V to generate videos
- Mitigating domain gap issues: the domain of T2I models can be fine-tuned to align with the actual domain, which is more efficient than tuning the entire video model
Moreover, we discovered that with just zero-shot modifications, we can transform pre-trained T2V models into I2V models, which means that Video Diffusion Models are Zero-Shot Image Animators!
Gallery
Below, we showcase the videos generated by AnimateZero on a variety of personalized T2I models.
Model: ToonYou
Model: CarDos Anime
Model: Anything V5
Model: Counterfeit V3.0
Model: Realistic Vision V5.1
Model: Photon
Model: helloObject
We also demonstrate an approach to control the motion of generated videos using text. In the following examples, we control the final state of the video by interpolating the text embeddings, thus achieving text-controlled motion.
Generated Image
Output Video
+ "happy and smile"
+ "angry and serious"
+ "open mouth"
+ "very sad"
Application: Video Editing
AnimateZero can also be used for better video editing, both for generated videos and real videos. One common use of AnimateDiff (AD) is to assist ControlNet (CN) in video editing, but it still has a domain gap problem. AnimateZero (AZ) has obvious advantages in this regard, namely generating videos with higher subjective quality and higher matching degree with the given text prompt.
Original Video
CN+AD
CN+AZ (ours)
Original Video
CN+AD
CN+AZ (ours)
"A girl swimming in lava"
"A girl is running in the forest, grassland"
"A candle is burning purple flames by the seaside"
"cute cat, colorful fur"
"A girl is dancing"
"a driving red car"
"A woman with red glasses"
"A young woman"
Application: Frame Interpolation
Extending the technology proposed by AnimateZero can simultaneously insert the first and last frames generated, realizing gradual transitions between the two frames, i.e., frame interpolation.
First Frame
Last Frame
Output Video
First Frame
Last Frame
Output Video
Application: Looped Video Generation
Looped video generation is a special case of frame interpolation, when the first frame and the last frame inserted are the same.
Generated Image
Output Video
Generated Image
Output Video
Generated Image
Output Video
Application: Real Image Animation
AnimateZero has the potential to do real image animation, although the domain gap problem still exists, which is limited by the domain of the T2I model used.
Real Image
Output Video
Real Image
Output Video
Real Image
Output Video
BibTeX
@misc{yu2023animatezero,
title={AnimateZero: Video Diffusion Models are Zero-Shot Image Animators},
author={Yu, Jiwen and Cun, Xiaodong and Qi, Chenyang and Zhang, Yong and Wang, Xintao and Shan, Ying and Zhang, Jian},
booktitle={arXiv preprint arXiv:2312.03793},
year={2023}
}
Project page template is borrowed from AnimateDiff.