| CARVIEW |
Unifying Visual Understanding and Generation via Text-Aligned Representations
NeurIPS 2025
Ziyan Yang2, Hao He1,2, Xiangyu Yue1,‡, Lu Jiang2,‡
Abstract
This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency.
Method
Framework. Tar is a unified multimodal LLM for both visual understanding and generation, which consists of an autoregressive LLM, a visual tokenizer TA-Tok and a visual de-tokenizer. Different from previous works, Tar leverages fully discrete, text-aligned visual tokens, eliminating the need of modality-specific designs like visual projectors. We can train Tar using the standard next-token prediction objective.
Visual Tokenizer. The key design of Tar is a text-aligned visual tokenizer, TA-Tok. It adds a vector quantization module to pretrained SigLIP2, thus converting input images into semantic, discrete tokens. Unlike other discrete tokenizers (e.g., VQVAE), TA-Tok directly leverages LLM's token embeddings as its codebook. The visual token can be represented by a transformed LLM token. Therefore, training unified MLLM with TA-Tok is similar to adding foreign languages to the LLM.
De-Tokenizer. Since the visual tokenizer TA-Tok is fully text-aligned, it cannot decode images directly like VQVAE. Instead, we propose visual de-tokenizers to decode visual tokens back to images. Here are two variants: an autoregressive model and a diffusion-based model. The AR de-tokenizer works well with discrete visual tokens from TA-Tok, while for the diffusion-based de-tokenier, we can leverage pretrained models for fast adaptation.
Implementation
The trained Tar model is a standard LLM with expanded visual vocabulary. As shown in the below code, the finetuned Qwen2 model can understand and generate TA-Tok's discrete tokens. We do not need to modify the architecture of Qwen2. Instead, we only need to feed TA-Tok's discrete tokens to Qwen2 or decode them with the de-tokenizer.
from transformers import AutoTokenizer, Qwen2ForCausalLM
from tok.ta_tok import TextAlignedTokenizer
class ImageToTextInference:
def __init__(self, config: I2TConfig):
self.config = config
self.model = Qwen2ForCausalLM.from_pretrained(config.model_path)
self.text_tokenizer = AutoTokenizer.from_pretrained(config.model_path)
self.visual_tokenizer = TextAlignedTokenizer.from_checkpoint(
config.ta_tok_path, load_teacher=False, input_type='indices')
def generate(self, image_path: str, prompt: str) -> str:
image = Image.open(image_path).convert('RGB')
image = to_tensor(image).unsqueeze(0)
image_code = self.visual_tokenizer(image)['encoded']
image_text = "".join([f"<I{x}>" for x in image_code[0].cpu().tolist()])
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"{image_text}\n{prompt}"}]
input_text = self.text_tokenizer.apply_chat_template(messages)
inputs = self.text_tokenizer(input_text, return_tensors="pt")
gen_ids = self.model.generate(
inputs.input_ids, max_new_tokens=256, do_sample=True)
return self.text_tokenizer.batch_decode(gen_ids)
from transformers import AutoTokenizer, Qwen2ForCausalLM
from tok.mm_autoencoder import MMAutoEncoder
class TextToImageInference:
def __init__(self, config: T2IConfig):
self.config = config
self.model = Qwen2ForCausalLM.from_pretrained(self.config.model_path)
self.tokenizer = AutoTokenizer.from_pretrained(self.config.model_path)
self.visual_tokenizer = MMAutoEncoder(**tok_config).eval()
def generate_image(self, prompt: str) -> Image.Image:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}]
input_text = self.tokenizer.apply_chat_template(messages)
inputs = self.tokenizer(input_text, return_tensors="pt")
gen_ids = self.model.generate(
inputs.input_ids, max_new_tokens=729, do_sample=True)
gen_text = self.tokenizer.batch_decode(gen_ids)[0]
gen_code = [int(x) for x in re.findall(r'<I(\d+)>', gen_text)]
gen_code = torch.tensor(gen_code).unsqueeze(0)
gen_tensor = self.visual_tokenizer.decode_from_encoder_indices(gen_code)
return Image.fromarray(gen_tensor[0].numpy())
Experiment
Results on Visual Understanding Benchmarks
* Token: Token type, including Continuous (C), Discrete (D), Semantic (S), Pixel (P) and Hybrid (H).
| Model | # LLM | Token | POPE↑ | MME-P↑ | MME-C↑ | MMB↑ | SEED↑ | GQA↑ | MMMU↑ |
|---|---|---|---|---|---|---|---|---|---|
| Show-o | 1.3B | D,P | 80.0 | 1097 | 248 | - | - | 58.0 | 26.7 |
| Harmon | 1.5B | C,H | 87.6 | 1155 | 321 | 65.5 | 67.1 | 58.9 | 38.9 |
| Janus | 1.5B | C,S | 87.0 | 1338 | 222 | 69.4 | 63.7 | 59.1 | 30.5 |
| Janus-Pro | 1.5B | C,S | 86.2 | 1444 | 268 | 75.5 | 68.3 | 59.3 | 36.3 |
| D-Dit | 2.0B | C,P | 84.0 | 1125 | - | - | - | 59.2 | - |
| Tar (Ours) | 1.5B | D,S | 88.4 | 1390 | 342 | 65.6 | 70.4 | 61.1 | 36.0 |
| ILLUME | 7B | C,S | 88.5 | 1445 | - | 65.1 | 72.9 | - | 38.2 |
| Chameleon | 7B | D,P | - | - | - | - | - | - | 22.4 |
| LWM | 7B | D,P | 75.2 | - | - | - | - | 44.8 | - |
| Liquid | 7B | D,P | 81.1 | 1119 | - | - | - | 58.4 | - |
| UniTok | 7B | D,H | 83.2 | 1448 | - | - | 61.1 | - | |
| VILA-U | 7B | D,H | 85.8 | 1402 | - | - | 59.0 | 60.8 | - |
| Janus-Pro | 7B | C,S | 87.4 | 1567 | 260 | 79.2 | 72.1 | 62.0 | 41.0 |
| MetaMorph | 8B | C,S | - | - | - | 75.2 | 71.8 | - | 41.8 |
| Tar (Ours) | 7B | D,S | 87.8 | 1571 | 355 | 74.4 | 73.0 | 61.3 | 39.0 |
Results on Visual Generation Benchmarks
| Method | GenEval | DPG Bench | ||||||
|---|---|---|---|---|---|---|---|---|
| Two Obj. | Counting | Color Attri. | Overall↑ | Entity | Attribute | Relation | Overall↑ | |
| LWM-7B | 0.41 | 0.46 | 0.15 | 0.47 | - | - | - | - |
| SEED-X-13B | 0.58 | 0.26 | 0.14 | 0.49 | - | - | - | - |
| Show-o-1.3B | 0.52 | 0.49 | 0.28 | 0.53 | - | - | - | - |
| Transfusion-7B | - | - | - | 0.63 | - | - | - | - |
| D-DiT-2B | 0.80 | 0.54 | 0.50 | 0.65 | - | - | - | - |
| ILLUME-7B | 0.86 | 0.45 | 0.28 | 0.61 | - | - | - | - |
| Janus-1.3B | 0.68 | 0.30 | 0.42 | 0.61 | 87.38 | 87.70 | 85.46 | 79.68 |
| Janus-Pro-1B | 0.82 | 0.51 | 0.56 | 0.73 | 88.63 | 88.17 | 88.98 | 82.63 |
| Harmon-1.5B | 0.86 | 0.57 | 0.48 | 0.76 | - | - | - | - |
| Janus-Pro-7B | 0.89 | 0.59 | 0.66 | 0.80 | 88.90 | 89.40 | 89.32 | 84.19 |
| Tar-1.5B | 0.91 | 0.76 | 0.51 | 0.76 | 89.35 | 86.91 | 93.50 | 82.96 |
| Tar-1.5B + Self Reflect | 0.92 | 0.77 | 0.55 | 0.78 | 88.48 | 87.83 | 93.38 | 84.10 |
| Tar-7B | 0.92 | 0.83 | 0.65 | 0.84 | 88.62 | 88.05 | 93.98 | 84.19 |
| Tar-7B + Self Reflect | 0.93 | 0.86 | 0.70 | 0.85 | 88.60 | 88.78 | 93.59 | 84.65 |
Demo
BibTeX
If you find our work useful, please cite our paper. BibTex code is provided below:@article{han2025tar,
title={Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations},
author={Han, Jiaming and Chen, Hao and Zhao, Yang and Wang, Hanyu and Zhao, Qi and Yang, Ziyan and He, Hao and Yue, Xiangyu and Jiang, Lu},
journal={arXiv preprint arXiv:2506.18898},
year={2025},
}