CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://visual-ai.github.io/vamp/ access-control-allow-origin: * strict-transport-security: max-age=31556952 expires: Mon, 29 Dec 2025 06:27:22 GMT cache-control: max-age=600 x-proxy-cache: MISS x-github-request-id: E52E:2B0FD4:8640BF:96D919:69521CF1 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 06:17:22 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210038-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766989042.179417,VS0,VE230 vary: Accept-Encoding x-fastly-request-id: 1de11ee3373d8e89a25dff0ad1d2a82a2ae5dc0f content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Tue, 16 Dec 2025 05:43:10 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"6940f16e-2b67" expires: Mon, 29 Dec 2025 06:27:22 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 1137:2118F1:84D310:956BC7:69521CF2 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 06:17:22 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210038-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766989042.428517,VS0,VE205 vary: Accept-Encoding x-fastly-request-id: 544393a45303ebc7f55045ac70c51599069327c4 content-length: 3117 VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models

VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models

Silin Cheng Kai Han

Visual AI Lab, The University of Hong Kong

Paper Code BibTeX

Abstract

Vision-language models (VLMs), such as CLIP, have shown strong generalization under zero-shot settings, yet adapting them to downstream tasks with limited supervision remains a significant challenge. Existing multi-modal prompt learning methods typically rely on fixed, shared prompts and deterministic parameters, which limits their ability to capture instance-level variation or model uncertainty across diverse tasks and domains.

To tackle this issue, we propose a novel Variational Multi-Modal Prompt Learning (VaMP) framework that enables sample-specific, uncertainty-aware prompt tuning in multi-modal representation learning. VaMP generates instance-conditioned prompts by sampling from a learned posterior distribution, allowing the model to personalize its behavior based on input content. To further enhance the integration of local and global semantics, we introduce a class-aware prior derived from the instance representation and class prototype. Building upon these, we formulate prompt tuning as variational inference over latent prompt representations and train the entire framework end-to-end through reparameterized sampling. Experiments on few-shot and domain generalization benchmarks show that VaMP achieves state-of-the-art performance, highlighting the benefits of modeling both uncertainty and task structure in our method.

Framework

Overview of the VaMP framework. (a) Class-Aware Prior Construction: Utilizing CLIP's frozen image encoder to process training samples, generating offline class prototypes for subsequent adaptation. (b) Variational Multi-Modal Prompt Adaptation (VMPA): Variational modeling mechanism where image-conditioned posterior and class prototype-based prior are aligned through KL divergence regularization of latent prompt distributions. (c) Training Pipeline: Full architecture of our proposed VaMP framework.

Performance

We evaluate VaMP on three challenging adaptation settings: base-to-new generalization, cross-dataset generalization, and domain generalization. VaMP consistently outperforms strong multi-modal prompt baselines while maintaining high parameter efficiency.

Base-to-Novel Generalization

Table 1: Comparison with state-of-the-art methods on base-to-novel generalization across 11 datasets.

Cross-Dataset Generalization

Table 2: Comparison with state-of-the-art methods on cross-dataset evaluation across 10 datasets.

Domain Generalization

Table 3: Comparison with state-of-the-art methods on domain generalization across 4 datasets.

BibTeX

@inproceedings{Cheng2025VaMP,
    author    = {Silin Cheng and Kai Han},
    title     = {VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models},
    booktitle = {Conference on Neural Information Processing Systems (NeurIPS)},
    year      = {2025}
}
Copied!

Original Source | Taken Source