This repository contains the official implementation for TACO: Taming Diffusion for in-the-wild Video Amodal Completion
Project Page | Paper | Weights | Dataset
conda create -n taco python=3.10
conda activate taco
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install git+https://github.com/OpenAI/CLIP.git
pip install git+https://github.com/Stability-AI/datapipelines.git
pip install -r requirements.txtDownload the checkpoint and put it under checkpoints
We provide pre-processed examples under examples. If you want to use your own example, we recommend using ffmpeg for getting frames and SAM2 for getting the visible masks throughout the video. Consider using the script segment_ui.py provided by Yu Liu for a user-friendly UI. Placing the script under the repository of SAM2 should be fine.
cd TACO-model
bash infer_single.shIf you want to use the autonomous driving checkpoint with a different resolution.
bash infer_single_drive.shThe checkpoints for normal videos (384x384) and autonomous driving (640x384) should be last.ckpt and drive_last.ckpt respectively.
We highly recommend choosing a large number for the num_samples parameter in the script and the results will be saved under the /output folder. Choose the most reasonable one after sampling multiple times, the results possess diversity and may not be very stable.
Download OvO Dataset and Kubric Dataset for benchmarking.
bash infer_kubric.sh
bash infer_OvO.shYou should revise the dataset path in the configs/inference_vac_kubric.yaml and configs/inference_vac_OvO.yaml file (data/params/dset_root) before running the inference script.
The results will be saved under the /output folder with name like {folder_name}_{sample_id} (For example, you should see a subfolder with name 0000_0 for the Kubric benchmark.) By default, we will save amodal rgb images along with a concatentated image containing both the origin rgb image and the amodal rgb image.
First, download the following available Stable Video Diffusion checkpoints: SVD (14 frames) and place it under the pretrained folder. We use the 14-frame version to save computational resources.
Download the dataset OvO_Easy, OvO_Hard, OvO_Drive and the corresponding path files, Easy_train.json and Easy_val.json for the OvO_Easy dataset for example. Unzip all the files. The data structure should be:
OvO_Easy/
MVImgNet/
0/
1/
...
SA-V/
sav_000/
sav_001/
...
Easy_train.json
Easy_val.json
Run training script:
bash train.shYou should revise the parameters in the script accordingly including data.params.dset_root, data.params.train_path and data.params.val_path before running the training script.
Note that this training script is set for an 8-GPU system, each with 80GB of VRAM. If you have smaller GPUs, consider using smaller batch size and gradient accumulation to obtain a similar effective batch size. If you want to debug to make sure everything is fine, please consider using the following script:
bash debug.sh
This should be fine with only one GPU.
If you want to continue training from the latest checkpoint, please consider using the following script:
bash train_continue.sh
We also provide the version for OvO_Drive training with a different resolution:
bash train_drive.sh
This repository is based on Generative Camera Dolly. We would like to thank the authors of these work for publicly releasing their code.
@article{lu2025taco,
title={Taco: Taming diffusion for in-the-wild video amodal completion},
author={Lu, Ruijie and Chen, Yixin and Liu, Yu and Tang, Jiaxiang and Ni, Junfeng and Wan, Diwen and Zeng, Gang and Huang, Siyuan},
journal={arXiv preprint arXiv:2503.12049},
year={2025}
}
