HOME
ABOUT
- RESULTS
- differences
- BENEFITS
- HISTORY
- TEAM
- LOCATION
- FACILITIES
- BANKING
- MEMBERSHIPS
- APPROVALS
- LICENCES
- SUPPLIERS
- SPONSORSHIPS
- MEDIA
- PRIVACY
AUCTIONS
SHIPPING
FEES
- TS REWARDS
TOOLS
guides
FAQ
CONTACT
- CONNECT

VEHICLES
BRAND
- JAPANESE CARS
  - DAIHATSU
  - EUNOS
  - FORD
  - HONDA
  - ISUZU
  - LEXUS
  - MAZDA
  - MITSUBISHI
  - MITSUOKA
  - NISSAN
  - SUBARU
  - SUZUKI
  - TOYOTA
- GERMAN CARS
- AMERICAN CARS
- BRITISH CARS
- ITALIAN CARS
- FRENCH CARS
- SWEDISH CARS
- KOREAN CARS
TYPE
- mobility
- VENDING
- instruction
- TAXIS
- AMBULANCES
- FIRE ENGINES
- HEARSES
- LIMOUSINES
- COMMERCIAL
CLASS
FUEL
TRUCKS
minitrucks
- DAIHATSU
- HONDA
- MAZDA
- MITSUBISHI
- NISSAN
- SUBARU
- SUZUKI
- DUMP
- CRANE
- CAMPER
- REFRIGERATED
- 4WD
- NEW
BUSES
MOTORHOMES
- YAHOO!
- RAKUTEN
- DEALER

PARTS
- FREE REPORT
- PARTS CONTAINERS
- PARTS SYSTEMS
- PARTS PROTECTION
- BODY SHELLS
- DISMANTLING
- ONLINE PARTS
- NEW PARTS
- INTERIOR PARTS
- EXTERIOR PARTS
  - BONNETS
  - BUMPERS
  - GRILLES
  - FENDERS
  - DOORS
  - TRUNKS
  - SPOILERS
  - LIGHTS
  - EMBLEMS
  - CAMERAS
- ENGINES
- TRANSMISSIONS
- WHEELS & TYRES
  - WHEELS
  - TYRES
CUTS
PERFORMANCE PARTS
TRUCK PARTS
MOTORBIKE PARTS
- MOTORBIKE ENGINES
- MOTORBIKE ACCESSORIES

MOTORBIKES
MARINE
FORKLIFTS
MACHINERY
AGRICULTURAL
OTHER
COUNTRY
- AUSTRALIA
- CANADA
- KENYA
- MYANMAR
- NEW ZEALAND
- PAKISTAN
- TANZANIA
- UNITED STATES

CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://swetha5.github.io/XFormer/ x-github-request-id: 6015:2C10E1:984F2F:AB10A3:69533ACA accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 02:36:58 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210054-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767062218.374266,VS0,VE206 vary: Accept-Encoding x-fastly-request-id: d7846c3decbe1d8c3b5de62272bfa598d6de5793 content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Tue, 14 Oct 2025 21:58:22 GMT access-control-allow-origin: * etag: W/"68eec77e-5806" expires: Tue, 30 Dec 2025 02:46:58 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 4C3D:2BC55:97D449:AA94AB:69533ACA accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 02:36:58 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210054-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767062219.594377,VS0,VE225 vary: Accept-Encoding x-fastly-request-id: 42a25277daae65da174d7bee563f4da6647d5c37 content-length: 6402 X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

More Research

Personal Webpage CRCV

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

Swetha Sirnam¹, Jinyu Yang², Tal Neiman², Mamshad Nayeem Rizve², Son Tran², Benjamin Yao², Trishul Chilimbi², Mubarak Shah^1,2

¹CRCV, University of Central Florida, ²Amazon Science

Paper arXiv Video Code (soon) Supplementary BibTex Poster

Current Multimodal Large Language Models (MLLMs) rely heavily on CLIP-ViT vision encoders, which excel at global semantic alignment but fail to capture fine-grained local visual details (e.g., object instances, attributes, spatial relations). They can describe a “dog in a park,” but struggle to count the dogs, infer their poses, or distinguish between subtle textures and spatial relations. X-Former addresses this limitation by integrating CLIP (contrastive learning) and MAE (masked image modeling) encoders via a dual cross-attention transformer that fuses their complementary strengths:

CLIP → low-frequency, global semantics
MAE → high-frequency, local details

X-Former empowers MLLMs to perceive both the global and local details enabling detailed visual understanding.

Figure: Qualitative Comparison demonstrating Fine-Grained Visual Understanding in Object Counting and Multi-class Identification Tasks. Our model showcases better visual understanding by accurately counting objects and effectively identifying them without confusion based on shape or color.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have revolutionized the field of vision-language understanding by integrating visual perception capabilities into Large Language Models (LLMs). The prevailing trend in this field involves the utilization of a vision encoder derived from vision-language contrastive learning (CL), showing expertise in capturing overall representations while facing difficulties in capturing detailed local patterns. In this work, we focus on enhancing the visual representations for MLLMs by combining highfrequency and detailed visual representations, obtained through masked image modeling (MIM), with semantically-enriched low-frequency representations captured by CL. To achieve this goal, we introduce X-Former which is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM through an innovative interaction mechanism. Specifically, X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders, i.e., CLIP-ViT (CL-based) and MAEViT (MIM-based). It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM. To demonstrate the effectiveness of our approach, we assess its performance on tasks demanding detailed visual understanding. Extensive evaluations indicate that X-Former excels in visual reasoning tasks involving both structural and semantic categories in the GQA dataset. Assessment on fine-grained visual perception benchmark further confirms its superior capabilities in visual understanding.

X-Former addresses a key limitations in MLLMs: MLLMs often fail to capture subtle visual cues such as object counts, or spatial relations, that are essential for real world visual reasoning.

🚀 Main Contributions

Dual-Encoder Fusion for Global–Local Understanding
Integrates CLIP-ViT (contrastive learning) and MAE-ViT (masked image modeling) as frozen backbones, CLIP for broad semantic context, MAE for detailed, high-frequency visual cues.
Dual Cross-Attention Transformer Module
A lightweight dual cross-attention design enables bidirectional interaction so local MAE features enrich global CLIP features and vice versa, yielding semantic, detail-aware embeddings.
Unified and Data-Efficient Pre-training
Jointly optimizes ITC, ITM, ITG, and Image Reconstruction using only 14M image–text pairs; over 10× fewer than BLIP-2 while maintaining strong performance.
Plug-and-Play Alignment with Frozen LLMs
Aligns X-Former outputs to frozen LLMs (e.g., OPT, FlanT5) via a single adapter layer; requires no curated instruction-tuning multimodal datasets and is easily portable to other MLLMs.
Fine-Grained Visual Reasoning
Consistently outperforms BLIP-2 on VQAv2, GQA, and OK-VQA, with significant gains in object counting and multi-class identification, achieving enhanced fine-grained visual perception.

Video

X-Former Stage 1: Pre-Training

Figure: Overview of X-Former which extends Q-Former by introducing a dual cross-attention module to capture both local and global visual features.

X-Former Stage 2: LLM Alignment

Figure: X-Former LLM Alignment. X-Former queries are aligned with a frozen LLM, FC layer adapts the query output(Z′) to LLM embedding space

Experimental Results

Zero-shot Visual Question Answering Performance on VQAv2, GQA, OK-VQA

Zero-Shot Results on VQAv2 dataset, *indicates the result is obtained using the official checkpoint.

Zero-Shot Results on GQA, OK-VQA dataset, *indicates the result is obtained using the official checkpoint.

Zero-shot Fine-Grained Visual Perception Evaluation.

Fine-Grained Visual Perception evaluation on Object Counting (OC) & Multi-class Identification (MCI) tasks.

BibTeX

@inproceedings{Swetha_Xformer_ECCV2024,
  title={X-former: Unifying contrastive and reconstruction learning for mllms},
  author={Sirnam, Swetha and Yang, Jinyu and Neiman, Tal and Rizve, Mamshad Nayeem and Tran, Son and Yao, Benjamin and Chilimbi, Trishul and Shah, Mubarak},
  booktitle={European Conference on Computer Vision},
  pages={146--162},
  year={2024},
  organization={Springer}
}

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Website adapted from the Nerfies source code.

HOME
ABOUT
AUCTIONS
SHIPPING
FEES
TOOLS
HOW
FAQ
CONTACT

Original Source | Taken Source