| CARVIEW |
LAFITE: Language-Free Text-to-Image Generation
Conventional text-to-image training requires vast image-caption pairs. LAFITE eliminates this dependency by training on image-only data, addressing the high cost of annotation and enabling model scalability in low-resource domains. LAFITE uses CLIP to extract image features and generates pseudo-text features through noise perturbation. These are injected into a StyleGAN2 generator, and contrastive losses ensure alignment between image outputs and pseudo-text embeddings—without using real captions. LAFITE achieves competitive performance on MS-COCO and outperforms DALL-E using only 1% of its model size and training data. This approach opens new paths for building efficient, domain-specific models from unlabeled image collections.
Customization Assistant for Text-to-Image Generation
Users often want to generate images containing new or personalized concepts (e.g., pets, faces) without retraining models. Existing solutions require fine-tuning, which is slow and resource intensive. This work proposes an interactive, efficient alternative. The proposed system, CAFE, combines a multimodal LLM and a diffusion model. It takes a user-provided image and text prompt, infers the intended concept using the LLM, and generates a conditioned image. It supports multi-turn dialogue and provides natural language explanations. CAFE enables real-time, fine-tuning-free personalization and outperforms existing non-finetuned baselines on benchmarks like DreamBench. Its conversational interface enhances usability and aligns well with user intent, moving text-to-image generation toward more intuitive applications.
ARTIST: Improving the Generation of Text-Rich Images
Existing diffusion models generate realistic visuals but struggle to render legible text, limiting applications like graphic design or signage. ARTIST addresses this by enabling accurate text rendering in generated images without sacrificing overall quality. ARTIST proposes a two-stage diffusion pipeline: a textual diffusion model is trained on synthetic data to learn text structure, while a visual diffusion model integrates this text knowledge via feature injection. A large language model (LLM) further guides the system by identifying textual content in prompts. ARTIST achieves significantly improved readability of text in generated images and outperforms prior models by up to 15% on dedicated benchmarks. The architecture’s disentangled design enables focused improvements and practical deployment in text-sensitive domains.
LLaVA-Reward: Multimodal Reward Modeling for T2I Evaluation
Evaluating text-to-image (T2I) outputs across multiple criteria—alignment, safety, fidelity—is labor-intensive and inefficient. Existing models rely heavily on prompts or token scoring, limiting scalability. LLaVA-Reward addresses this by using hidden states of pretrained MLLMs to provide multi-perspective evaluations efficiently. LLaVA-Reward augments a lightweight MLLM (e.g., Phi-3.5-vision) with LoRA adapters and a novel Skip-connection Cross-Attention (SkipCA) module. It processes image-text pairs through a visual encoder and predicts scalar reward scores using the EOS token’s hidden state, with preference learning trained via pairwise ranking loss (Bradley-Terry). LLaVA-Reward delivers state-of-the-art performance on MJ-Bench, TIFA160, and UnsafeBench benchmarks, outperforming CLIP-based and VQA-based models while offering better inference-time efficiency. It also improves image generation quality through diffusion inference-time scaling. The model is adaptable to different perspectives using LoRA, making it practical for scalable reward modeling in T2I tasks.
Thoughts on Multimodal Large Language Models
Post-Training Limits Their Potential
Chameleon and LLaMA-Fusion both build on top of pre-trained, text-only language models. These base LLMs are later adapted for multimodal tasks through fine-tuning. While this leverages strong language capabilities, it introduces limitations:
- Inherited Constraints: The base LLMs weren’t designed for images. Their architecture and data were optimized for text, so multimodal adaptation feels like a retrofit, limiting seamless integration of visual inputs.
- Performance Ceiling: Unified MLLMs often underperform specialized systems. Vision models like CLIP or generation models like Stable Diffusion excel at their specific tasks, while MLLMs must compromise. As a result, they typically lag behind SoTA models in both understanding and generation.
Bottlenecks in Chameleon and LLaMA-Fusion
- Chameleon’s VQ-VAE Bottleneck:
- Information Loss: The compression discards fine visual details—hurting generation quality.
- Scalability Limits: Expanding the codebook to capture more nuance requires heavy compute, capping performance.
- LLaMA-Fusion’s Pretrained Model Constraint:
- Misaligned multimodal tokenizer: Different modality encoders should share the same embedding space, while LLaMA-Fusion cannot fully bridge the gap between modalities.
- Limited Adaptability: Keeping LLM weights frozen preserves text skills but prevents deeper cross-modal alignment.
Multimodal Pretraining + Diffusion Head
- Multimodal Pretraining with Better Visual Tokenizers
- Train the LLM from scratch on both text and images. Instead of VQ-VAE, use more expressive visual tokenizers, like continuous embeddings (e.g., from Vision Transformers) preserve richer visual detail. Pretraining on large multimodal datasets enables the model to learn aligned representations from the ground up. This unified tokenizer tackles a fundamental multimodal alignment problem, where a novel architecture or training paradigm is under exploration.
- Illume+ (https://illume-unified-mllm.github.io) provides an intermediate solution, where two visual encoders are used to balance semantic and pixel-level info.
- Add a Diffusion Head for Generation
- Once pretrained, attach a diffusion model to generate images: Diffusion excels at detail and realism, outperforming VQ-based generation. It’s better at controllability, producing outputs that more accurately reflect text prompts. This setup combines deep understanding from the MLLM with high-fidelity generation from the diffusion head.
Classical Models vs. Modern Approaches and Feasible Solutions
- Classical Methods (e.g., T5 + Diffusion):
- They use text-trained encoders to drive generation, often leading to poor alignment and weak control—images may not match prompts well.
- Modern Unified Models (OpenAI GPT-4o and Google Gemini):
- Treat all inputs—text and vision—as tokens in a single sequence. This end-to-end architecture learns cross-modal dependencies natively, improving both understanding and controllability. It’s compute-heavy, but it works.
Summary
Current MLLMs is retrofitted and constrained—limited by tokenization (Chameleon) or rigidity (LLaMA-Fusion). Classical generation models struggle with alignment and world knowledge. Unified, token-level models like GPT-4o or Gemini point the way forward. The feasible solution is to pretrain a multimodal LLM, then attach a diffusion head for top-tier generation quality and great controllability.
]]>LLaVAR: Visual Instruction Tuning for Text-Rich Images
LLaVAR is the first MLLM in the world for text-rich image, which can handle both text-rich image and natural image understanding. LLaVAR extends the LLaVA architecture by targeting text-rich image understanding through data augmentation rather than architectural changes. Starting with 422K likely-textual images from LAION, we extracted OCR text and used GPT-4 to generate 16K multi-turn Q&A conversations, which were added to the instruction tuning set. This dataset significantly improved performance on text-based VQA benchmarks, achieving up to 20% accuracy gains.
LLaVAR’s experiments emphasized resolution’s importance for reading: small text is often lost with 224×224 encoders. To overcome this, we stacked CLIP encoders to simulate higher resolution and used external OCR/captioning tools to pre-process images. Text summaries were fed to the LLM alongside visual tokens, leading to a hybrid setup that improved reading without expanding the token budget. The study influenced later designs by showing how OCR tools and data-driven tuning alone can substantially boost an MLLM’s ability to “read” without changing its backbone.
TRINS and LaRA: Instruction Dataset + Tool-aware Model
LLaVAR uses LLMs to generate high-quality text-rich instruction data, and the quality is not always satisfying since LLMs are only provided with OCR words and original short captions. To tackle this problem, TRINS constructed a large-scale dataset focused on text-rich images, combining human-written captions and LLM-generated QA pairs across 50K images. We spent significant efforts for data sourcing from LAION-Highres subset, using multiple models and heuristic rules.
LaRA, built on LLaVA, uses PaddleOCR to extract image text and injects it into the text prompt. This “OCR-as-input” strategy gives models direct access to text that would otherwise be missed due to vision resolution limits. The OCR results are merged with the instruction, and training freezes the vision encoder while tuning the LLM and projection layer. LaRA achieved SOTA on benchmarks like TextVQA and DocVQA, even performing well without OCR inputs. The success showed that combining high-quality data and lightweight text integration strategies can deliver strong reading capabilities without rearchitecting the entire model.
TRINS is useful for text-rich image generation and evaluation as well. We have used this dataset to obtain SoTA MLLMs twice during 2023 and 2024 on the OCRBench leaderboard.
LLaVA-Read: Dual Visual Encoders and OCR + Layout Awareness
LLaVA-Read addresses key MLLM limitations—low text resolution and lack of layout awareness—by using three encoders: one low-resolution ViT-based CLIP encoder, a high-resolution Conv-based CLIP encoder, and a high-res OCR pipeline as visual text encoder. Visual features are merged using intermediate-layer fusion to keep token count constant, while merging high-resolution details into visual tokens. In visual text encoders, OCR outputs are tokenized with special spatial markers and appended to the LLM input.
This dual-path setup enables the model to attend both to rich visual context and structured text information. A layout-aware training phase ensures better alignment across modalities. The model outperforms prior methods on complex benchmarks requiring both text comprehension and spatial reasoning. The hybrid design proved effective at selectively leveraging OCR for long text while still using visual cues for short labels and layout-sensitive tasks.
MMR: Benchmarking MLLM Reading Comprehension in Images
MMR was introduced to expose gaps in reading capabilities of MLLMs, especially on tasks requiring more than simple OCR. It covers 11 types of tasks, from spatial reasoning to font identification and text grounding, with 550 human-written Q&A pairs.
Evaluations showed that many models performed poorly on visual text grounding, layout reasoning, or comparing multiple text blocks. Even top-performing MLLMs struggled. MMR highlights how existing benchmarks underestimated the difficulty of text-rich reasoning and provides a granular framework for evaluating improvements in models. Internally, it has become a key diagnostic tool for gauging “reading IQ” of MLLMs before deployment in document-oriented applications.
SV-RAG: Efficient Long-Doc QA with Visual Retrieval
SV-RAG addresses the challenge of answering questions over multi-page documents. Instead of using a separate retriever, it trains LoRA adapters within an MLLM to handle both retrieval and answering. A shared MLLM backbone is “switched” between retrieval and QA using adapter weights, enabling end-to-end visual RAG.
The retriever uses ColBERT-style late interaction across visual tokens to rank relevant pages. Critically, pages are treated as images, letting the model leverage both layout and visual cues. SV-RAG achieved large performance gains and 8× speedup on long-doc tasks like SlideVQA by avoiding exhaustive page-level inference. The approach scales well, adds minimal parameters, and enables vision-aware retrieval using the same model backbone—paving the way for efficient multimodal document understanding pipelines.
]]>