HTTP/2 301
server: GitHub.com
content-type: text/html
location: https://jshi31.github.io/InstantBooth/
access-control-allow-origin: *
strict-transport-security: max-age=31556952
expires: Tue, 30 Dec 2025 14:37:02 GMT
cache-control: max-age=600
x-proxy-cache: MISS
x-github-request-id: D685:2DDCFF:A2C6FF:B6C39C:6953E136
accept-ranges: bytes
age: 0
date: Tue, 30 Dec 2025 14:27:02 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210098-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767104822.221207,VS0,VE209
vary: Accept-Encoding
x-fastly-request-id: d0793dd377010a96477b7254042ef7647d2e9908
content-length: 162
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Tue, 06 Aug 2024 05:17:40 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"66b1b1f4-3f64"
expires: Tue, 30 Dec 2025 14:37:02 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 96C3:123DE:A297C4:B697BF:6953E135
accept-ranges: bytes
age: 0
date: Tue, 30 Dec 2025 14:27:02 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210098-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767104822.444062,VS0,VE206
vary: Accept-Encoding
x-fastly-request-id: 9413d3dfd51dcd094eda1f9de760d29b768bc849
content-length: 4710
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
* Equal Contribution

|
Personalized text-to-image generation: given a set of images consisting of the same concept, the model can generate new scenes based on the input concept while following the input prompts.
|
Recent advances in personalized image generation allow a pre-trained text-to-image model to learn a new concept from a set of images. However, existing personalization approaches usually require test-time finetuning for each concept, which is time-consuming and difficult to scale. We propose InstantBooth, a novel approach built upon pre-trained text-to-image models that enables instant text-guided image personalization without test-time finetuning. We achieve this with several major components. First, we learn the general concept of the input images by converting them to a textual token with a learnable image encoder. Second, to keep the fine details of the identity, we learn rich visual feature representation by introducing a few adapter layers to the pre-trained model. We train our components only on text-image pairs without using paired images of the same concept. Compared to test-time finetuning-based methods like DreamBooth and Textual-Inversion, our model can generate competitive results on unseen concepts concerning language-image alignment, image fidelity, and identity preservation while being 100 times faster.
Model Structure
An overview of our approach. We first inject a unique identifier $\hat{V}$ to the original input prompt to obtain "Photo of a $\hat{V}$ person", where $\hat{V}$ represents the input concept. Then we use the concept image encoder to convert the input images to a compact textual embedding and use a frozen Text encoder to map the other words to form the final prompt embeddings. We extract rich patch feature tokens from the input images with a patch encoder and then inject them to the adapter layers for better identity preservation. The U-Net of the pre-trained diffusion model takes the prompt embeddings and the rich visual feature as conditions to generate new images of the input concept. During training, only the image encoders and the adapter layers are trainable, the other parts are frozen. The model is optimized with only the reconstruction loss of the diffusion model. (We omit the object masks of the input images for simplicity.).
|
Visual comparison with other methods

Visualization for comparison of our method with Textual Inversion and DreamBooth
|
More Visual Result
Paper
 |
Jing Shi, Wei Xiong, Zhe Lin, Hyun Joon Jung
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
(ArXiv)
|
Acknowledgements
We thank Qing Liu for dataset preparation and He Zhang for object mask computation.
The template of this webpage is borrowed from Richard Zhang.
|