What’s Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning

Project Overview

This project addresses a critical gap in the evaluation of vision-language models: their ability to understand and reason about causal relationships. Although VLMs have demonstrated impressive performance on many downstream tasks, it remains unclear whether they truly grasp causal relations rather than relying on object recognition or activity identification shortcuts.

To bridge this gap, we introduce two new benchmarks:

VQA-Causal, VCR-Causal: Two new benchmarks designed to isolate and rigorously evaluate VLMs’ causal reasoning abilities.

Our key findings:

VLMs excel at object and activity recognition but perform poorly on causal reasoning, often only marginally better than random guessing.
This shortcoming is primarily due to a severe lack of explicit causal expressions in widely used training datasets.
Fine-tuning with hard negative cases can significantly improve a model’s causal reasoning ability while preserving downstream performance and generalization.

Our study highlights a major limitation of current VLMs and lays the groundwork for future research on causal understanding.

Code & Data

1. Causal Order Reasoning Tests

The code for the causal order reasoning experiments can be found in the causaltest/ directory.
For example, to run the VQA-Causal tests:

python causaltest_clipfamily.py      # CLIP-family models (e.g. ViT-L/14, ViT-B/32, NegCLIP, RobustCLIP)
python causaltest_flava.py           # FLAVA
python llava_test.py                 # LLaVA
python vicuna_vqa.py                 # Vicuna

Running the scripts above yields results on CLIP-family models (CLIP ViT-L/14, CLIP ViT-B/32, NegCLIP, RobustCLIP), FLAVA, LLaVA, and Vicuna.

2. Object & Activity Understanding Tests

The code for the object‐and‐activity (O&A) tests can be found in the multichoice/ directory. For example：

python multichoice_clipfamily.py      # CLIP-family models (ViT-L/14, ViT-B/32, NegCLIP, RobustCLIP)
python multichoice_flava.py           # FLAVA

Running the scripts above yields results on CLIP-family models (CLIP ViT-L/14, CLIP ViT-B/32, NegCLIP, RobustCLIP) and FLAVA.

3. Data Analysis

The code for data analysis can be found in the data_analysis/ directory. for example:

python analysis_coco.py               # COCO dataset analysis
python analysis_laion400m.py          # LAION-400M dataset analysis
python analysis_vcr.py                # VCR dataset analysis
python analysis_vqaval.py             # VQA dataset analysis

Running the scripts above yields the data analysis results on MSCOCO, LAION-400M, VCR, VQA datasets.

4. Datasets

The datasets can be found in the datasets/ directory. for example, the VQA-Causal and VCR-Causal can be found in the datasets/benchmarks/ directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

What’s Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning

Project Overview

Code & Data

1. Causal Order Reasoning Tests

2. Object & Activity Understanding Tests

3. Data Analysis

4. Datasets

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
causaltest		causaltest
data_analysis		data_analysis
datasets		datasets
multichoice		multichoice
README.md		README.md

limenlp/CausalVLM

Folders and files

Latest commit

History

Repository files navigation

What’s Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning

Project Overview

Code & Data

1. Causal Order Reasoning Tests

2. Object & Activity Understanding Tests

3. Data Analysis

4. Datasets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages