You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can run the inference code with the following command. Just add the trigger word "target" in front of the noun you'd like to specify as in the example below. You will find your output videos in the ./results folder.
We have tested the inference code on RTX 3090 and A100.
python inference.py \
--image_path assets/image.png \
--mask_path assets/mask_0.png \
--prompt "In a serene, well-lit kitchen with clean, modern lines, the woman reaches forward, and picks up the target mug cup with her hand. She brings the target mug to her lips, taking a slow, thoughtful sip of the coffee, her gaze unfocused as if lost in contemplation. The steam from the coffee curls gently in the air, adding warmth to the quiet ambiance of the room."
Since our base model, CogvideoX, is trained with long prompts, prompt quality directly impacts the output quality. Please refer to this guide from CogVideoX for prompt enhancement. The generated videos can still suffer from limitations, including object disappearances or implausible dynamics. You may have to try multiple times for the best results.
Training and Dataset
We will soon release the training code and data.
Citation
If you find TAViD useful for your work, please consider citing:
@article{kim2025target,
title={Target-Aware Video Diffusion Models},
author={Kim, Taeksoo and Joo, Hanbyul},
journal={arXiv preprint arXiv:2503.18950},
year={2025}
}
Acknowledgements
We sincerely thank the authors of following amazing works for their open-sourced codes, models, and datasets: