| CARVIEW |
Introducing New Modality to Frozen Large Language Models
——— ICCV 2025 ———
📜 Abstract
We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM’s parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.
⚡ X-Fusion Capabilities
📐 X-Fusion
Model Architecture
We propose X-Fusion to address two key challenges: (i) retaining the language abilities of the pre-trained LLM while (ii) adapting it with image generation/understanding capabilities. To achieve this, we introduce a Dual-Tower architecture that freezes all language weights and separate vision weights in each layer to help process visual information for the LLM. This approach aligns text and vision features not only at input or output level, but also at the intermediate processing level.
To show the effectiveness of our approach, we proposed alternative transformer block variants for multimodal integration, including (a) Single Tower, (b) Gated Layer,and (c) Dual Projection. We find the Dual-Tower architecture achieves the best performance in both image generation and understanding tasks among them.
Training Recipe
Alongside of the model architecture, we also study the training recipe for X-Fusion. We focused on two key questions:- How does noise level affect performance in a joint training setting?
- Does multitask training provide mutual benefits between tasks?


🖼️ Experiment Result
Text-to-Image Generation

Image Captioning

Fine-tuned X-Fusion on Downsteam Tasks
Can X-Fusion extend its capabilities to other downstream vision-and-language tasks? We fine-tuned our model on four tasks simultaneously—including image editing, localization, in/oupt-painting, and Visual Question Answering using internal datasets for 20k training steps. In the following figure, we demonstrate that our unified X-Fusion model can handle multiple tasks without creating task-specific weights.
Transfer from Pretrained Diffusion Model
One main advantage of our dual-tower design is that allows non-identical block designs in language and vision towers as each block in the both tower processes the entire feature sequence independently. With this advantage, we could transfer the image generation capability from a large-scale pretrained diffusion model that uses diffusion transformers. We trained a a variation of X-Fusion using Llama3 model as the language tower and an in-house pretrained text-to-image DiT model as its vision tower, notated as X-Fusion(Pretrained DiT), with 50K steps. We show that X-Fusion(Pretrained DiT) obtained stronger image generation capability compared to the vanilla X-Fusion.📥 BibTeX
@article{mo2025xfusion,
title={X-fusion: Introducing new modality to frozen large language models},
author={Mo, Sicheng and Nguyen, Thao and Huang, Xun and Iyer, Siddharth Srinivasan and Li, Yijun and Liu, Yuchen and Tandon, Abhishek and Shechtman, Eli and Singh, Krishna Kumar and Lee, Yong Jae and others},
journal={arXiv preprint arXiv:2504.20996},
year={2025}
}
💌 Acknowledgement
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.