This project addresses a critical gap in the evaluation of vision-language models: their ability to understand and reason about causal relationships. Although VLMs have demonstrated impressive performance on many downstream tasks, it remains unclear whether they truly grasp causal relations rather than relying on object recognition or activity identification shortcuts.
To bridge this gap, we introduce two new benchmarks:
- VQA-Causal, VCR-Causal: Two new benchmarks designed to isolate and rigorously evaluate VLMs’ causal reasoning abilities.
Our key findings:
- VLMs excel at object and activity recognition but perform poorly on causal reasoning, often only marginally better than random guessing.
- This shortcoming is primarily due to a severe lack of explicit causal expressions in widely used training datasets.
- Fine-tuning with hard negative cases can significantly improve a model’s causal reasoning ability while preserving downstream performance and generalization.
Our study highlights a major limitation of current VLMs and lays the groundwork for future research on causal understanding.
The code for the causal order reasoning experiments can be found in the causaltest/ directory.
For example, to run the VQA-Causal tests:
python causaltest_clipfamily.py # CLIP-family models (e.g. ViT-L/14, ViT-B/32, NegCLIP, RobustCLIP)
python causaltest_flava.py # FLAVA
python llava_test.py # LLaVA
python vicuna_vqa.py # VicunaRunning the scripts above yields results on CLIP-family models (CLIP ViT-L/14, CLIP ViT-B/32, NegCLIP, RobustCLIP), FLAVA, LLaVA, and Vicuna.
The code for the object‐and‐activity (O&A) tests can be found in the multichoice/ directory. For example:
python multichoice_clipfamily.py # CLIP-family models (ViT-L/14, ViT-B/32, NegCLIP, RobustCLIP)
python multichoice_flava.py # FLAVARunning the scripts above yields results on CLIP-family models (CLIP ViT-L/14, CLIP ViT-B/32, NegCLIP, RobustCLIP) and FLAVA.
The code for data analysis can be found in the data_analysis/ directory. for example:
python analysis_coco.py # COCO dataset analysis
python analysis_laion400m.py # LAION-400M dataset analysis
python analysis_vcr.py # VCR dataset analysis
python analysis_vqaval.py # VQA dataset analysisRunning the scripts above yields the data analysis results on MSCOCO, LAION-400M, VCR, VQA datasets.
The datasets can be found in the datasets/ directory. for example, the VQA-Causal and VCR-Causal can be found in the datasets/benchmarks/ directory.