| CARVIEW |
"Principal Components" Enable A New Language of Images
(ICCV 2025) A New Paradigm for Compact and Interpretable Image Representations
[Read the Paper] | [GitHub] | [Huggingface Tokenizer Demo] | [Huggingface Generation Demo]
Xin Wen1*,
Bingchen Zhao2*,
Ismail Elezi3,
Jiankang Deng4,
Xiaojuan Qi1
* Equal Contribution
1 University of Hong Kong |
2 University of Edinburgh
3 Huawei London Research Centre |
4 Imperial College London
Introduction & Motivation
Deep generative models have revolutionized image synthesis, but how we tokenize visual data remains an open question. While classical methods like Principal Component Analysis (PCA) introduced compact, structured representations, modern visual tokenizersβfrom VQ-VAE to SD-VAEβoften prioritize reconstruction fidelity at the cost of interpretability and efficiency.
The Problem
- Lack of Structure: Tokens are arbitrarily learned, without an ordering that prioritizes important visual features first.
- Semantic-Spectrum Coupling: Tokens entangle high-level semantics with low-level spectral details, leading to inefficiencies in downstream applications.
Can we design a compact, structured tokenizer that retains the benefits of PCA while leveraging modern generative techniques?
Demo
Semanticist Tokenizer Demo
Semanticist AR Generation Demo
Key Contributions (Whatβs New?)
- π PCA-Guided Tokenization: Introduces a causal ordering where earlier tokens capture the most important visual details, reducing redundancy.
- β‘ Semantic-Spectrum Decoupling: Resolves the issue of semantic-spectrum coupling to ensure tokens focus on high-level semantic information.
- π Diffusion-Based Decoding: Uses a diffusion decoder for the spectral auto-regressive property to naturally separate semantic and spectral content.
- π Compact & Interpretability-Friendly: Enables flexible token selection, where fewer tokens can still yield high-quality reconstructions.
Visualizing the Problem: Semantic-Spectrum Coupling
Existing methods fail to separate semantics from spectral details, leading to inefficiencies in token usage.
- π¨ Current Tokenizers: More tokens simultaneously increase both semantic content and low-level spectral details, making compression inefficient.
- β Our Approach: Tokens capture semantics first, ensuring a coarse-to-fine hierarchical structure.
π Power Spectrum Analysis (Visual)
This figure illustrates the semantic-spectrum coupling effect by comparing reconstructions from TiTok (top) and our method (bottom) using an increasing number of tokens.
- Top Row (TiTok):
- As more tokens are added, both semantic details and spectral power increase simultaneously.
- The power spectrum (red: GT, blue: reconstructed) shifts upward, showing spectral entanglement.
- Earlier reconstructions fail to preserve semantic meaningful details.
- Bottom Row (Ours - Semanticist):
- Our method maintains semantic clarity even with fewer tokens.
- The power spectrum remains consistent with the original image across different token counts, confirming spectral decoupling.
- Reconstructions follow a coarse-to-fine hierarchy, mirroring the global precedence effect in human vision.
This analysis demonstrates that Semanticist produces a structured latent space where tokens capture high-level semantic meaning first, avoiding spectral artifacts. The below figure gives more comparisons.
Model Architecture: Structured Visual Tokenization with a PCA-like Hierarchy
Our model introduces a structured 1D causal tokenization framework designed to efficiently encode images into a compact and semantically meaningful latent space. Unlike conventional tokenizers that encode images into a 2D grid of latent vectors, our approach enforces a hierarchical PCA-like structure, where each token progressively refines the image representation in a coarse-to-fine manner.
Key Components of Our Architecture
1. Causal ViT Encoder
The encoding process begins with a Causal Vision Transformer (ViT) Encoder, which receives an input image and generates concept tokens in a 1D sequence. Unlike conventional 2D latent spaces, these tokens are ordered causally, ensuring that earlier tokens capture the most salient semantic features, while later tokens refine details.
π See the figure below, where the encoder transforms the input image into a structured token sequence.
2. Nested Classifier-Free Guidance (CFG) for PCA-like Structure
To enforce a PCA-like hierarchy, we apply a nested classifier-free guidance (CFG) strategy, where later tokens are progressively replaced with a null condition token during training. This forces earlier tokens to prioritize capturing the most critical information, leading to an interpretable, structured representation.
π The image above illustrates how nested CFG selectively refines token importance.
π This PCA-like structure is mathematically proved in our paper.
3. Diffusion-Based DiT Decoder
A Diffusion Transformer (DiT) Decoder reconstructs the image from these structured latent tokens. Unlike traditional deterministic decoders, our diffusion-based approach naturally follows a spectral autoregressive process, reconstructing images from low to high frequencies. This prevents semantic-spectrum coupling, ensuring that tokens encode high-level meaning instead of low-level artifacts.
π The figure below demonstrates how image reconstructions progressively improve as more tokens are used.
Coarse-to-Fine Token Representation
Our hierarchical tokenization closely resembles the global precedence effect in human vision, where broader structures are perceived before finer details. This property allows our tokenizer to adaptively reconstruct images with varying numbers of tokens, making it highly flexible for compression, image generation, and recognition tasks.
π As shown in the image above, increasing the number of tokens leads to progressively better reconstructions while maintaining structured information.
Why Our Model is Different
- β 1D Causal Tokenization β Unlike 2D token grids, our model enforces an ordered structure where token importance follows a hierarchical pattern.
- β PCA-Like Variance Decay β Earlier tokens contain the most significant information, while later tokens refine details, mimicking PCA decomposition.
- β Diffusion-Based Decoding β Prevents semantic-spectrum entanglement, ensuring that tokens capture high-level meaning rather than low-level frequency artifacts.
Experimental Results
We validate Semanticist through extensive experiments, demonstrating:
- π State-of-the-art Reconstruction: Achieves the lowest FID scores among visual tokenizers.
- π¨ Better Generative Performance: Autoregressive models trained on Semanticist tokens match leading baselines with fewer tokens.
- π Improved Interpretability: PCA-like hierarchy aligns with human perception and enhances linear probing classification accuracy.
π Quantitative Results Table
Broader Impact & Limitations
Potential Applications
- π Image Compression: More efficient representations with reduced redundancy.
- π Generative Models: Enhanced image synthesis with structured latents.
- π Data Analysis: Improved interpretability and feature extraction.
Limitations
- β³ Inference Speed: Diffusion decoding is slower than direct pixel regression.
- π€ Alternative Architectures: Flow-matching or consistency models could improve efficiency.
- π Adaptive Tokenization: Dynamic token lengths could further optimize representation.
Ethical Considerations
Like all generative models, our approach could be misused for deepfake creation or content manipulation. We encourage responsible use and propose safeguards to mitigate misuse.
Acknowledgements
We sincerely appreciate the dedicated support we received from the participants of the human study. We are also grateful to Anlin Zheng and Haochen Wang for helpful suggestions on the design of technical details.
Author Contribution Statement
X.W. and B.Z. conceived the study and guided its overall direction and planning. X.W. proposed the original idea of semantically meaningful decomposition for image tokenization. B.Z. developed the theoretical framework for nested CFG and the semantic spectrum coupling effect and conducted the initial feasibility experiments. X.W. further refined the model architecture and scaled the study to ImageNet. B.Z. led the initial draft writing, while X.W. designed the figures and plots. I.E., J.D., and X.Q. provided valuable feedback on the manuscript. All authors contributed critical feedback, shaping the research, analysis, and final manuscript.
Citation: If you find our work useful, please cite us!
@inproceedings{semanticist,
title={``{P}rincipal Components'' Enable A New Language of Images},
author={Wen, Xin and Zhao, Bingchen and Elezi, Ismail and Deng, Jiankang and Qi, Xiaojuan},
booktitle={IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025}
}