The attacker’s goal is to manipulate the model into responding to original concept images with texts consistent with a destination concept, using
stealthy poison samples that can evade human visual inspection.
A poison sample consists of a poison image that looks like a clean image from the destination concept, and a congruent text description. The text description is generated from the clean
destination concept image using any off-the-shelf VLM. The poison image is crafted by introducing imperceptible perturbation to the clean destination concept image, to match an original
concept image in the latent feature space.
When training on these poison samples, the VLM learns to associate the the original concept feature (in the poison image) with the destination concept texts, achieving the attacker's goal.
| CARVIEW |
Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models
Ning Yu6, Tom Goldstein1, Furong Huang1
3Salesforce Research 4Apple
5University of Waterloo 6Netflix Eyeline Studios
Neurips, 2024
Responses of the clean and poisoned LLaVA-1.5 models. The poisoned samples are crafted using a different VLM, MiniGPT-v2.
Abstract
Vision-Language Models (VLMs) excel in generating textual responses from visual inputs,
yet their versatility raises significant security concerns. This study takes the first step
in exposing VLMs' susceptibility to data poisoning attacks that can manipulate responses to
innocuous, everyday prompts. We introduce Shadowcast, a stealthy data poisoning attack method
where poison samples are visually indistinguishable from benign images with matching texts.
Shadowcast demonstrates effectiveness in two attack types. The first is Label Attack, tricking
VLMs into misidentifying class labels, such as confusing Donald Trump for Joe Biden.
The second is Persuasion Attack, which leverages VLMs' text generation capabilities to
craft narratives, such as portraying junk food as health food, through persuasive and
seemingly rational descriptions. We show that Shadowcast are highly effective in achieving
attacker's intentions using as few as 50 poison samples. Moreover, these poison samples
remain effective across various prompts and are transferable across different VLM
architectures in the black-box setting. This work reveals how poisoned VLMs can generate
convincing yet deceptive misinformation and underscores the importance of data quality for
responsible deployments of VLMs.
TL;DR: Shadowcast is the first stealthy data poisoning attack against Vision-Language Models (VLMs).
The poisoned VLMs can disseminate misinformation coherently, subtly shifting users’ perceptions.
Method
Illustration of how Shadowcast crafts a poison sample with visually matching image and text descriptions.
Below is another example of the poison sample where the original concept is "Junk Food" and the destination concept is "Healthy and Nutritious Food". The poison image looks like the clean
destination concept image, and the text visually matches the image.
Experiment
We consider the following four tasks for poisoning attacks exemplifying the practical risks of VLMs, ranging from misidentifying political figures to disseminating healthcare misinformation.
The red ones are Label Attacks, tricking VLMs into misidentifying class labels, such as confusing Donald Trump for Joe Biden.
The green ones Persuasion Attacks, which leverage VLMs’ text generation capabilities to craft narratives, such as portraying junk food as health food,
through persuasive and seemingly rational descriptions.
Attack tasks and their associated concepts.
We study both grey-box and black-box scenarios. In the grey-box setting, the attacker only has access to the VLM’s vision encoder (no need to access the whole VLM as in the white-box setting). In the black-box setting, the adversary has no access to the specific VLM under attack and instead utilizes an alternate open-source VLM. We evaluate the attack success rates under different poison ratios.
Grey-box results
Attack success rate of Label Attack for LLaVA1.5.
Attack success rate of Persuasion Attack for LLaVA-1.5.
Shadowcast begins to demonstrate a significant impact (over 60% attack success rate) with a poison rate of under 1% (or 30 poison samples)!
Black-box results
(Architecture transferability) Attack success rate for LLaVA-1.5 when InstructBLIP (left) and MiniGPTv2 (right) are used to craft poison images.
Shadowcast is still effective even across different VLM architectures!
Attack Robustness
What if the VLM uses image data augmentation (as a defense method) during training? Will the poisoned VLM exhibit targeted behaviour when different text prompts are used? Our evaluation shows positive results.
(Data augmentation) Attack success rate for LLaVA-1.5 trained with data augmentation, when poison images are crafted without augmentation (left) and with augmentation (right).
(Generalization to diverse prompts) Attack success rates when diverse prompts are used during test time.
Ethics and Disclosure
This study uncovers a pivotal vulnerability in the visual instruction tuning of large vision language models (VLMs), demonstrating how adversaries might exploit data poisoning to disseminate misinformation undetected. While the attack methodologies and objectives detailed in this research introduce new risks to VLMs, the concept of data poisoning is not new, having been a topic of focus in the security domain for over a decade. By bringing these findings to light, our intent is not to facilitate attacks but rather to sound an alarm in the VLM community. Our disclosure aims to elevate vigilance among VLM developers and users, advocate for stringent data examination practices, and catalyze the advancement of robust data cleaning and defensive strategies. In doing so, we believe that exposing these vulnerabilities is a crucial step towards fostering comprehensive studies in defense mechanisms and ensuring the secure deployment of VLMs in various applications.
BibTeX
@article{
xu2024shadowcast,
title={Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models},
author={Xu, Yuancheng and Yao, Jiarui and Shu, Manli and Sun, Yanchao and Wu, Zichu and Yu, Ning and Goldstein, Tom and Huang, Furong},
journal={arXiv preprint arXiv:2402.06659},
year={2024}
}