You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation"
A better AR image genenation paradigm and transformer model structure based on 2D autoregression. It generates images of higher quality without increasing computation budget.
A spark of vision-language intelligence for the first time, enabling unconditional rich-text image generation, outperforming diffusion models like DDPM and Stable Diffusion on dedicated rich-text image datasets, highlighting the distinct advantage of autoregressive models for multimodal modeling.
First download all vq-tokenizers and model checkpoints from π€
Sampling ImageNet Examples and Compute FID
cd ./src
bash ./scripts/sampling_dnd_transformer_imagenet.sh # edit the address for vq model checkpoint and dnd-transformer checkpoint# An npz would be saved after genearting 50k images, you can follow https://github.com/openai/guided-diffusion/tree/main/evaluations to compute the generated FID.
Sampling Text-Image Examples
cd ./src
bash ./scripts/sampling_dnd_transformer_text_image.sh # edit the address for vq model checkpoint and dnd-transformer checkpoint
Sampling arXiv-Image Examples
cd ./src
bash ./scripts/sampling_dnd_transformer_arxiv_image.sh # edit the address for vq model checkpoint and dnd-transformer checkpoint
Training
Training VQVAEs
We refer to RQVAE for training the multi-depth VQVAEs.
Extract Codes for Training
cd ./src
bash ./scripts/extract_codes_tencrop_c2i.sh
Training DnD-Transformers
cd ./src
bash ./scripts/train_dnd_transformer_imagenet.sh
Thanks to RQVAE and LlamaGen for providing the open source codebase.
Reference
@misc{chen2024sparkvisionlanguageintelligence2dimensional,
title={A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation},
author={Liang Chen and Sinan Tan and Zefan Cai and Weichu Xie and Haozhe Zhao and Yichi Zhang and Junyang Lin and Jinze Bai and Tianyu Liu and Baobao Chang},
year={2024},
eprint={2410.01912},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.01912},
}
About
[ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation"