CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Thu, 11 Jul 2024 02:54:14 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"668f4956-1d2d3" expires: Mon, 29 Dec 2025 11:39:39 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: A145:3A7A40:8B38B3:9C5FF9:69526623 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 11:29:39 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210077-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767007780.778150,VS0,VE218 vary: Accept-Encoding x-fastly-request-id: fc16d363f54cb12aa10ded23567bd38f1b6e00c6 content-length: 18606 MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

ICML 2024 Oral

Dongping Chen¹^*, Ruoxi Chen²^*, Shilin Zhang¹^*, Yaochen Wang¹^*, Yinuo Liu¹^*, Huichi Zhou¹^*, Qihui Zhang¹^*, Yao Wan¹^†, Pan Zhou¹^†, Lichao Sun³

¹Huazhong University of Science and Technology,
²Zhejiang University of Technology, ³Lehigh University

dongpingchen0612@gmail.com, {yaowan1992, panzhou}@hust.edu.cn

Paper Code Data Leaderboard

In this work, we introduce MLLM-as-a-Judge which thoroughly explores three types of Multimodal LLM-as-a-Judge in Vision-Language settings. Specifically, there are three-fold major contributions:

A Benchmark. We are the first to develop a comprehensive benchmark MLLM-AS-A-JUDGE in multimodal domains, with human annotations to assess the judging capability of MLLMs in tasks of Scoring Evaluation, Pair Comparison and Batch Ranking.
Two Datasets. We curate two human preference datasets with high-quality questions MLLM-AS-A-JUDGE-HQ and MLLM-AS-A-JUDGE-HARD dataset with hallucination instances. They can serve as a rigorous testing ground to facilitate the development of MLLMs.
Findings and Implications. Our evaluation of mainstream MLLMs reveals that while MLLMs exhibit alignment with human judgments in pair comparison tasks, notable discrepancies can be found in scoring evaluation and batch ranking. Furthermore, our findings reveal that MLLMs exhibit a range of biases and hallucinations, along with inconsistent judgments during the evaluation process, representing significant hurdles in establishing MLLMs as reliable judges.

Abstract

Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities, encompassing three distinct tasks: *Scoring Evaluation*, *Pair Comparison*, and *Batch Ranking*. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in *Pair Comparison*, there is a significant divergence from human preferences in *Scoring Evaluation* and *Batch Ranking*. Furthermore, a closer examination reveals persistent challenges in the judgment capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators. In light of this, we advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges.

Takeaway

While the MLLM (e.g., LLaVA and GPT-4V) demonstrates superior performance in certain datasets and inferior performance in others, we wish to underscore that our **MLLM-as-a-Judge** method leads to a solid conclusion: GPT-4V consistently outperforms the baseline across diverse datasets on average. However, it remains noteworthy that GPT-4V does not entirely supplant human judges in particular datasets, as elaborated in Section 4.1 of our paper. This overarching perspective on benchmarking **MLLM-as-a-Judge** underscores the central focus of our study, which aims to assess MLLM performance from a comprehensive standpoint rather than evaluating individual MLLM performance on specific datasets.

Experiment Setups

Models. We evaluate the judging performance of eleven leading MLLMs – GPT-4V, Gemini-Pro-Vision-1.0, LLaVA-1.5-13b, LLaVA-1.6- 7b/13b/34b, Qwen-VL-Plus/Max and CogVLM – across three distinct evaluation settings. Adapting the “Analyze-then-Judge” paradigm, which is a one-step CoT approach, we first ask MLLMs to analyze responses and then provide a judgment based on their analysis.
Metrics. After collecting responses from MLLM judgments, we quantify their alignment with human annotations across three settings, employing distinct metrics as follows:
- Scoring Evaluation: Following LLM-as-a-Judge , we compute the Pearson similarity between the MLLMs’ judgments and human ratings across different sub-datasets.
- Pair Comparison: We assess the similarity between the MLLM judgments and human decisions using accuracy, F1-score, and recall to assess the judging abilities of models.
- Batch Evaluation: We consolidate the ranking results into a singular sequence and employ the Normalized Levenshtein distance to evaluate the similarity between judgments from MLLMs and human annotation.
Apart from traditional metrics for similarity assessment between judgments from MLLMs and humans, we further evaluate the judgments provided by MLLMs to uncover latent bias and hallucination in 10 datasets. We also invite human annotators for further validation, focusing on the following aspects:
- Human Agreement: This involves a simple ‘yes’ or ‘no’ response to assess agreement with the MLLM judgments. While some judgments might appear reasonable, they may still be considered incorrect due to unique human perspectives. Hence, we conduct experiments on human agreement to address situations that traditional metrics may not adequately capture.
- Analysis Grading: Each MLLM analysis is assigned a score from 1 to 5, considering relevance, accuracy, creativity, and response granularity, detailed in Appendix F.
- Hallucination Detection: Given the propensity for hallucination issues in the complex reasoning chains and longterm vision-language contexts of MLLMs, we task human annotators with identifying any hallucinations in the analyses of MLLM judgments, adhering to established definitions of vision and language hallucination.

Overall (Main)
Human Agreement
3-step COT v.s. Directly
w./w.o. Vision
Overall (Full)

The overall performance of different MLLMs in judging, compared with human annotations on different datasets. We sample all the data three times and took the average to mitigate the casualty. w. and w.o. tie represents tie and non-tie situations respectively. We omit Gemini’s results on the diffusion task for its challenges in processing AI-generated images. All presented data of Pearson similarity exhibit a p-value below 0.05, indicating a statistically significant level of confidence.

Settings	MLLM	Categories														Ave.
Settings	MLLM	COCO	C.C.	Diff.	Graphics	Math	Text	WIT	Chart	VisIT	CC-3M	M2W	SciQA	Aes	MM-Vet	Ave.
Score (↑)	LLaVA-1.5-13b	0.247	0.227	0.060	0.242	0.093	0.245	0.109	0.237	0.177	0.071	0.424	0.279	0.414	0.322	0.225
	LLaVA-1.6-34b	0.285	0.251	-0.012	0.262	0.238	0.258	0.151	0.318	0.198	0.109	0.022	0.206	0.025	0.265	0.184
	Gemini	0.262	0.408	-	0.400	0.228	0.222	0.418	0.343	0.336	0.374	0.324	0.073	0.360	0.207	0.304
	GPT-4V	0.454	0.507	0.458	0.645	0.606	0.624	0.579	0.645	0.620	0.431	0.185	0.383	0.401	0.326	0.490
	Qwen-vl-max	0.311	0.117	0.072	0.218	0.175	0.196	0.028	0.312	0.151	0.045	0.244	0.115	0.177	0.216	0.170
Pair w. Tie (↑)	LLaVA-1.5-13b	0.273	0.478	0.286	0.273	0.657	0.510	0.369	0.383	0.456	0.484	0.347	0.223	0.389	0.254	0.384
	LLaVA-1.6-34b	0.493	0.600	0.570	0.300	0.374	0.551	0.543	0.254	0.398	0.392	0.513	0.434	0.524	0.499	0.460
	Gemini	0.616	0.787	-	0.650	0.436	0.664	0.605	0.500	0.660	0.560	0.370	0.262	0.190	0.312	0.509
	GPT-4V	0.696	0.824	0.847	0.639	0.564	0.673	0.679	0.657	0.640	0.612	0.521	0.415	0.606	0.529	0.636
	Qwen-vl-max	0.403	0.464	0.372	0.494	0.438	0.500	0.533	0.479	0.421	0.421	0.411	0.392	0.325	0.474	0.438
Pair w.o. Tie (↑)	LLaVA-1.5-13b	0.327	0.537	0.302	0.300	0.726	0.684	0.600	0.610	0.648	0.583	0.449	0.443	0.498	0.344	0.504
	LLaVA-1.6-34b	0.607	0.824	0.855	0.402	0.587	0.750	0.758	0.381	0.503	0.564	0.712	0.679	0.694	0.762	0.648
	Gemini	0.717	0.840	-	0.770	0.678	0.793	0.688	0.658	0.711	0.652	0.471	0.358	0.265	0.400	0.615
	GPT-4V	0.804	0.870	0.922	0.807	0.801	0.805	0.734	0.849	0.761	0.703	0.699	0.647	0.755	0.659	0.773
	Qwen-vl-max	0.657	0.674	0.556	0.667	0.635	0.732	0.647	0.638	0.560	0.586	0.608	0.646	0.741	0.662	0.644
Batch (↓)	LLaVA-1.5-13b	0.577	0.492	0.562	0.535	0.598	0.650	0.616	0.644	0.620	0.563	0.639	0.563	0.650	0.652	0.597
	LLaVA-1.6-34b	0.449	0.411	0.500	0.561	0.575	0.544	0.483	0.552	0.542	0.479	0.529	0.437	0.500	0.450	0.501
	Gemini	0.287	0.299	-	0.473	0.462	0.430	0.344	0.520	0.426	0.357	0.613	0.412	0.467	0.529	0.432
	GPT-4V	0.318	0.353	0.070	0.385	0.348	0.319	0.290	0.347	0.300	0.402	0.597	0.462	0.453	0.411	0.361
	Qwen-vl-max	0.477	0.407	0.500	0.480	0.507	0.515	0.493	0.539	0.468	0.407	0.563	0.503	0.444	0.500	0.486

Settings	MLLM	Categories	Average
COCO	C.C.	Diffusion	Graphics	Math	Text	WIT	Chart	VisIT	CC-3M
Score (↑)	Gemini	0.783	0.739	-	0.618	0.536	0.621	0.749	0.630	0.712	0.702	0.677
GPT-4V	0.799	0.725	0.506	0.688	0.638	0.706	0.714	0.676	0.779	0.754	0.699
Pair (↑)	Gemini	0.705	0.833	-	0.733	0.520	0.717	0.827	0.620	0.853	0.703	0.724
GPT-4V	0.821	0.926	0.873	0.794	0.618	0.752	0.790	0.796	0.797	0.766	0.793
Batch (↓)	Gemini	0.642	0.639	-	0.333	0.330	0.473	0.511	0.315	0.422	0.554	0.469
GPT-4V	0.663	0.639	0.912	0.536	0.475	0.615	0.641	0.640	0.622	0.467	0.621

Settings	MLLM	Categories	Ave.
COCO	C.C.	Diffusion	Graphics	Math	Text	WIT	Chart	VisIT	CC-3M
Score (↑)	GPT-4V	0.454	0.507	0.458	0.645	0.606	0.624	0.579	0.645	0.620	0.431	0.557
GPT-4V (+CoT)	0.246	0.165	0.192	0.385	0.397	0.400	0.298	0.443	0.423	0.038	0.299
Gemini	0.262	0.408	-	0.400	0.228	0.222	0.418	0.343	0.336	0.374	0.299
Gemini (+CoT)	0.127	0.068	0.117	0.220	0.132	0.182	0.105	0.140	0.222	0.128	0.144
Pair w. Tie (↑)	GPT-4V	0.696	0.824	0.847	0.639	0.564	0.673	0.679	0.657	0.640	0.612	0.683
GPT-4V (+CoT)	0.507	0.657	0.561	0.601	0.515	0.580	0.489	0.521	0.646	0.553	0.563
Gemini	0.616	0.787	-	0.650	0.436	0.664	0.605	0.500	0.660	0.560	0.609
Gemini (+CoT)	0.233	0.239	0.420	0.207	0.284	0.329	0.352	0.357	0.247	0.239	0.291
Pair w.o. Tie (↑)	GPT-4V	0.804	0.870	0.922	0.807	0.801	0.805	0.734	0.849	0.761	0.703	0.806
GPT-4V (+CoT)	0.673	0.821	0.845	0.707	0.738	0.787	0.548	0.756	0.753	0.654	0.728
Gemini	0.717	0.840	-	0.770	0.678	0.793	0.688	0.658	0.711	0.652	0.723
Gemini (+CoT)	0.267	0.275	0.573	0.264	0.414	0.424	0.427	0.511	0.299	0.319	0.377
Batch (↓)	GPT-4V	0.323	0.344	0.092	0.401	0.367	0.341	0.302	0.364	0.313	0.407	0.325
GPT-4V (+CoT)	0.428	0.416	-	0.427	0.434	0.401	0.366	0.406	0.422	0.472	0.419
Gemini	0.287	0.299	-	0.473	0.462	0.430	0.344	0.520	0.426	0.357	0.400
Gemini (+CoT)	0.441	0.481	0.542	0.595	0.494	0.533	0.483	0.569	0.486	0.463	0.509

MLLM	Settings	Score (↑)	Pair (↑)	Batch (↓)
Pearson	w. Tie	w.o. Tie	Edit Dis.
LLaMA2-70b	Vision Exp	0.060	0.404	0.550	0.643
No Vision	0.126	0.374	0.537	0.583
Mixtral-8x7b	Vision Exp	0.054	0.374	0.543	0.603
No Vision	0.151	0.478	0.731	0.546
GPT-3.5	Vision Exp	0.154	0.453	0.591	0.473
No Vision	0.223	0.459	0.644	0.504
GPT-4V	Vision Exp	0.435	0.544	0.878	0.400
No Vision	0.299	0.491	0.868	0.394
Gemini	Vision Exp	0.120	0.438	0.785	0.472
No Vision	0.108	0.433	0.758	0.470

Settings	MLLM	Categories	Ave.
COCO	C.C.	Diff.	Graphics	Math	Text	WIT	Chart	VisIT	CC-3M	M2W	SciQA	Aes	MM-Vet
Score (↑)	CogVLM	0.107	-0.048	0.049	-0.158	0.065	0.097	-0.131	-0.135	0.278	0.157	-	-	-	-	0.028
GPT-4V	0.454	0.507	0.458	0.645	0.606	0.624	0.579	0.645	0.620	0.431	0.185	0.383	0.401	0.326	0.490
LLaVA-1.5-13b	0.247	0.227	0.060	0.242	0.093	0.245	0.109	0.237	0.177	0.071	0.424	0.279	0.414	0.322	0.225
LLaVA-1.6-7b	0.300	0.243	0.058	0.200	0.090	0.193	0.044	0.085	0.228	0.026	0.299	0.156	0.148	0.171	0.160
LLaVA-1.6-13b	0.289	0.226	-0.110	0.078	0.056	0.086	0.062	0.120	0.163	0.200	0.140	0.136	0.163	0.183	0.128
LLaVA-1.6-34b	0.285	0.251	-0.012	0.262	0.238	0.258	0.151	0.318	0.198	0.109	0.022	0.206	0.025	0.265	0.184
Gemini-Pro	0.262	0.408	-	0.400	0.228	0.222	0.418	0.343	0.336	0.374	0.324	0.073	0.360	0.207	0.304
Gemini-Pro*	0.211	0.230	0.114	0.146	0.060	0.095	0.041	0.160	0.174	0.177	0.282	0.030	0.329	0.144	0.157
Qwen-vl-max	0.311	0.117	0.072	0.218	0.175	0.196	0.028	0.312	0.151	0.045	0.244	0.115	0.177	0.216	0.170
Qwen-vl-plus	-0.050	0.195	0.019	0.126	0.106	0.161	0.151	0.089	0.128	0.106	0.268	0.092	0.347	-0.019	0.123
Qwen-vl-chat	-0.012	-0.012	0.033	-0.422	0.011	-0.028	0.021	0.036	-0.060	0.083	0.092	-0.017	-0.040	0.115	-0.014
Pair w. Tie (↑)	CogVLM	0.548	0.409	0.562	0.613	0.412	0.250	0.273	0.262	0.324	0.433	-	-	-	-	0.409
GPT-4V	0.696	0.824	0.847	0.639	0.564	0.673	0.679	0.657	0.640	0.612	0.521	0.415	0.606	0.529	0.636
LLaVA-1.5-13b	0.273	0.478	0.286	0.273	0.657	0.510	0.369	0.383	0.456	0.484	0.347	0.223	0.389	0.254	0.384
LLaVA-1.6-7b	0.493	0.571	0.550	0.383	0.314	0.507	0.500	0.352	0.401	0.402	0.563	0.310	0.544	0.463	0.454
LLaVA-1.6-13b	0.493	0.586	0.590	0.333	0.339	0.507	0.587	0.296	0.454	0.459	0.506	0.322	0.545	0.448	0.462
LLaVA-1.6-34b	0.493	0.600	0.570	0.300	0.374	0.551	0.543	0.254	0.398	0.392	0.513	0.434	0.524	0.499	0.460
Gemini-Pro	0.616	0.787	-	0.650	0.436	0.664	0.605	0.500	0.660	0.560	0.370	0.262	0.190	0.312	0.509
Gemini-Pro*	0.273	0.273	0.240	0.324	0.237	0.275	0.136	0.377	0.232	0.294	0.368	0.260	0.209	0.303	0.272
Qwen-vl-max	0.403	0.464	0.372	0.494	0.438	0.500	0.533	0.479	0.421	0.421	0.411	0.392	0.325	0.474	0.438
Qwen-vl-plus	0.479	0.507	0.650	0.450	0.328	0.522	0.500	0.380	0.453	0.383	0.577	0.321	0.601	0.457	0.472
Qwen-vl-chat	0.493	0.486	0.480	0.311	0.248	0.406	0.543	0.310	0.332	0.292	0.547	0.298	0.507	0.478	0.409
Pair w.o. Tie (↑)	CogVLM	0.654	0.450	0.643	0.704	0.481	0.292	0.500	0.423	0.500	0.591	-	-	-	-	0.524
GPT-4V	0.804	0.870	0.922	0.807	0.801	0.805	0.734	0.849	0.761	0.703	0.699	0.647	0.755	0.659	0.773
LLaVA-1.5-13b	0.327	0.537	0.302	0.300	0.726	0.684	0.600	0.610	0.648	0.583	0.449	0.443	0.498	0.344	0.504
LLaVA-1.6-7b	0.593	0.597	0.618	0.434	0.468	0.636	0.561	0.471	0.436	0.466	0.633	0.621	0.568	0.705	0.558
LLaVA-1.6-13b	0.614	0.612	0.663	0.382	0.487	0.618	0.659	0.420	0.503	0.549	0.576	0.598	0.565	0.620	0.562
LLaVA-1.6-34b	0.607	0.824	0.855	0.402	0.587	0.750	0.758	0.381	0.503	0.564	0.712	0.679	0.694	0.762	0.648
Gemini-Pro	0.717	0.840	-	0.770	0.678	0.793	0.688	0.658	0.711	0.652	0.471	0.358	0.265	0.400	0.615
Gemini-Pro*	0.311	0.340	0.308	0.419	0.336	0.366	0.200	0.439	0.290	0.358	0.469	0.336	0.266	0.398	0.345
Qwen-vl-max	0.657	0.674	0.556	0.667	0.635	0.732	0.647	0.638	0.560	0.586	0.608	0.646	0.741	0.662	0.644
Qwen-vl-plus	0.596	0.556	0.771	0.554	0.463	0.735	0.575	0.535	0.521	0.510	0.659	0.612	0.627	0.659	0.598
Qwen-vl-chat	0.603	0.523	0.625	0.333	0.386	0.574	0.625	0.431	0.370	0.396	0.618	0.594	0.539	0.755	0.527
Batch (↓)	GPT-4V	0.318	0.353	0.070	0.385	0.348	0.319	0.290	0.347	0.300	0.402	0.597	0.462	0.453	0.411	0.361
LLaVA-1.5-13b	0.577	0.492	0.562	0.535	0.598	0.650	0.616	0.644	0.620	0.563	0.639	0.563	0.650	0.652	0.597
LLaVA-1.6-7b	0.575	0.538	0.618	0.462	0.601	0.598	0.564	0.679	0.586	0.503	0.507	0.403	0.525	0.565	0.552
LLaVA-1.6-13b	0.614	0.612	0.663	0.382	0.487	0.618	0.659	0.420	0.503	0.549	0.531	0.415	0.500	0.557	0.536
LLaVA-1.6-34b	0.449	0.411	0.500	0.561	0.575	0.544	0.483	0.552	0.542	0.479	0.529	0.437	0.500	0.450	0.501
Gemini-Pro	0.287	0.299	-	0.473	0.462	0.430	0.344	0.520	0.426	0.357	0.613	0.412	0.467	0.529	0.432
Gemini-Pro*	0.378	0.370	-	0.572	0.508	0.452	0.417	0.572	0.492	0.434	0.636	0.412	0.489	0.506	0.480
Qwen-vl-max	0.477	0.407	0.500	0.480	0.507	0.515	0.493	0.539	0.468	0.407	0.563	0.503	0.444	0.500	0.486
Qwen-vl-plus	0.640	0.616	0.500	0.666	0.644	0.634	0.592	0.747	0.671	0.540	0.488	0.409	0.523	0.470	0.581
Qwen-vl-chat	0.733	0.701	0.500	0.669	0.638	0.554	0.638	0.723	0.687	0.668	0.500	0.389	0.531	0.572	0.607

Empirical Results

MLLM Judgment vs Human Annotation

Scoring Evaluation: GPT-4V demonstrated the highest similarity to human scoring with a similarity score of 0.557. In contrast, Gemini achieved only 0.332, with LLaVA and CogVLM scoring even lower. This discrepancy is primarily due to Gemini’s tendency to assign scores around 4 points, seldom giving 1 or 2 points. LLaVA and CogVLM show a similar pattern to Gemini, predominantly assigning scores around 4 points. We attribute this to a ‘High-Score’ Bias, akin to the ‘Yes/No’ bias, which may result from an imbalance in positive and negative judging instructions in their training data, severely limits their ability to provide just and varied scores in scoring settings. In comparison, GPT-4V’s scores are more evenly distributed and align closely with human preferences.
Pair Comparison: GPT-4V outshines other MLLMs in pair comparison tasks, achieving 0.683 in tie settings and 0.806 in non-tie settings, surpassing 0.8 in many datasets, which indicate a strong alignment with human preferences. Gemini, LLaVA, and CogVLM show a marked preference for declaring a clear winner, possibly due to a lack of tie situations in their training, leading to biased judgments. It’s also interesting that the frequency of ties given by GPT-4V closely mirrors that of human judges, suggesting similar thresholds for tie decisions.
Batch Ranking: GPT-4V aligns more closely with human ranking results, indicating a significant lead with a mean Levenshtein Distance of 0.313. However, there is still substantial room for improvement in this task for all MLLMs. Notably, CogVLM is unable to provide a full ranking in this context, offering only the top choice; so it was excluded from this comparison; LLaVA also exhibits position bias influenced by prompt structure, often replicating judgments seen in example prompts, which complicates its ability to produce fair judgments.

MLLM Judging Consistency

To be a reliable judge, consistent decision-making across repeated evaluations of the same query is crucial. For this purpose, we conducted six repeated tests with MLLM judgments and calculated the weighted average consistency scores and Majority Consistency Criterion ratios for GPT-4V and Gemini. Despite a higher temperature setting, GPT-4V substantially outperforms Gemini across all tasks. Particularly in Pair Comparison, GPT-4V achieves a higher consistency score of 0.675, but it encounters difficulties in maintaining similar levels of consistency in Scoring and Batch Ranking tasks, with scores dropping to 0.611 and 0.418, indicating the challenge of producing qualified and convincing judgments.

Vision Perception benefits Judging

We explore the feasibility of using LLMs for judging textbased responses without directly analyzing the original images. This involves two approaches: omitting vision information entirely and providing a detailed description of the picture. Surprisingly, we find that LLMs’ performance in multimodal judging tasks significantly improved with picture descriptions, achieving a Pearson similarity of 0.435 in Scoring Evaluation tasks, markedly outperformed judgments made without any vision perception. Notably, in non-tie Pair Comparison, MLLMs with detailed vision descriptions even exceed the standard performance of MLLMs in judging. This suggests that MLLMs may lack certain human-like judging capabilities, while LLMs can effectively judge multimodal tasks when provided with comprehensive task-related descriptions.

Human Agreement

Our manual evaluation of MLLMs in judging, focusing on agreement and scoring, revealed notable findings. GPT-4V achieved around 70% human agreement across all settings, excelling in the Pair Comparison task with 79.3% agreement. Specifically, GPT-4V reached 78% in human agreement for Pair Comparison, with Gemini close at 72%, indicating strong performance in most sample pairs and supporting the idea that large models excel in pairwise distinctions (Zheng et al., 2023b), though improvements are needed in other judging settings. In the Scoring Evaluation task, GPT-4V achieved a 70% human agreement rate, peaking at 79.9% in MS-COCO, while Gemini maintained an average rate of 67.7%. To assess the consistency of MLLM judging quality across multiple responses to a single image-instruction pair, we employed the Mean Absolute Deviation (MAD) metric. This measures the average absolute variance between individual scores and the mean, thereby gauging quality variability. Figure 16 demonstrates that GPT-4V exhibits lower variation in quality assessments, indicating more consistent and reliable judgment compared to Gemini, which is further evidenced by its superior performance. However, in Batch Ranking, both models showed reduced human performance. GPT-4V managed 69% human agreement, and Gemini only 47%. Additionally, their analyses received lower scores, especially in complex tasks like Math and graphics information. This suggests that the models’ inherent capabilities may not fully support understanding and completing intricate user instructions to provide accurate judgments.

Bias and Hallucination

Egocentric Bias

It means models assign higher scores to their own responses while scoring others lower. GPT-4V exhibits a slight degree of Egocentricity. This bias contrasts with Gemini, which tends to judge each response more equitably, displaying a similar scoring distribution across different sources. Further investigation into the rationale behind GPT-4V’s self-favoring behavior indicated that its judgments align closely with its own ethical guidelines. For instance, when faced with questions involving user privacy, GPT-4V’s responses typically emphasize privacy preservation and refuse to engage, leading to higher self-scoring in these scenarios. Despite efforts in prompt engineering to encourage impartiality, these models inherently rely on their built-in judgment criteria retained from post-alignment, which can lead to a divergence from human preferences. Such a discrepancy highlights the complexity of aligning MLLM judgments with human standards.

Position Bias

It means a model consistently favors answers in specific positions, often influenced by training data that typically places correct responses at the beginning or end of prompts. Figure 4 illustrates this bias in LLaVA and CogVLM, showing a distinct preference for one particular option in Pair Comparison tasks, habitually selecting the answer in their favored position. Such bias might arise from their restricted instruction-following capabilities, making their judgments disproportionately influenced by the structure of prompts. For example, when a Batch Ranking prompt includes a sample answer sequence like ‘ABCD’, LLaVA tends to replicate this sequence in its responses with a high frequency of 88.2%, significantly more than other sequences. However, introducing multiple examples in the prompt appears to lessen this bias, as evidenced by a reduced Position Bias score of 53.3% when two examples are provided. This suggests that augmenting prompts with more examples might help guide these models to adhere more closely to the given instructions.

Length Bias

Length bias means models prefer longer answers over concise but correct ones, also known as verbosity bias (Zheng et al., 2023b). As illustrated in Figure 6, both GPT-4V and Gemini are inclined to award higher scores and preference to longer content. To delve deeper into this bias, we conducted an expanded scoring experiment using GPT-4, which lacks vision perception, to semantically increase the length of answers without altering their original meaning. As shown in Figure 7, the results showed a noticeable increase in the scores assigned by GPT-4V and Gemini, averaging gains of 0.6 and 0.75 points, respectively. This finding conclusively demonstrates the presence of Verbosity Bias, suggesting that MLLMs might exploit extended text as a backdoor method to achieve higher scores.

Hallucination Detection and Mitigation

We observe a higher incidence of hallucinations in Batch Ranking tasks compared to Pair Comparison and Score Evaluation, which may stem from misunderstandings of the long-term context. Delving deeper, we encounter more severe language hallucinations, including miscomprehensions of textual meanings or errors in text retrieval, which significantly impact the accuracy and reliability of the final judgments. To mitigate hallucination, we perform multi-step CoT on MLLM-AS-A-JUDGE-HARD by telling MLLMs to judge step-by-step, perform extra reasoning steps before normal “Analyze-then-Judge” setting, following: 1) imageinstruction 2) image 3) instruction. As shown in Table 6 in paper, hallucinations are mitigated across all settings, with extra reasoning on image information showing the most notable improvement in both score and pair tasks. Notably, in the Batch Ranking task, which involves analyzing longer texts, more reasoning steps significantly reduce hallucinations.

Models

Model	Model Size	Open-Weight	Version	Creator	Source
GPT-4V(ision)	unknown	No	-	OpenAI	OpenAI API
Gemini-Pro-Vision	unknown	No	v1.0	Google	Google API
Gemini-Pro-Vision	unknown	No	v1.0-latest	Google	Google API
Qwen-VL-Max	unknown	No	-	Ali	Ali API
Qwen-VL-Plus	unknown	No	-	Ali	Ali API
Qwen-VL-Chat	9.6b	Yes	-	Ali	HuggingFace
LLaVA-1.6-34b	34b	Yes	v1.6	Microsoft	HuggingFace
LLaVA-1.6-13b	13b	Yes	v1.6	Microsoft	HuggingFace
LLaVA-1.6-7b	33b	Yes	v1.6	Microsoft	HuggingFace
LLaVA-1.5-13b	13b	Yes	v1.5	Microsoft	HuggingFace
CogVLM	16b	Yes	-	Tsinghua	HuggingFace

Detailed Selected Dataset

Datasets and corresponding tasks in benchmark construction, each task is matched with several required abilities (Rec.Recognition, Comp.-Comprehension, Inf.-Inferential, Mul.-Multilingual)
Dataset	Image type	Task	Ability Required	Image-Inst. Pair	Batch	Score	Pair
Conceptual Captions	Web Image	Captioning	Rec.&Comp.	300	100	398	597
ChartQA	Chart	Chart reasoning	Rec.&Comp.	300	100	400	600
InfographicVQA	Infographics	Graph reasoning	Rec.&Comp.	300	100	398	573
MathVista	Mathematics	Math reasoning	Rec.&Comp.&Inf.	300	200	793	1185
TextVQA	Text	Text reading	Rec.&Comp.	300	100	399	582
WIT	Multilingual text	Transcription	Rec.&Mul.	300	100	399	582
MS COCO	Real-life scene	Image Segmentation	Rec.&Comp.	300	100	398	617
DiffusionDB	Diffusion	Comprehensive	Rec.&Comp.&Inf.	300	100	299	300
CC-3M Concept-balanced	Comprehensive	Comprehensive	Rec.&Comp.&Inf.	300	100	396	597
VisIT-Bench	Comprehensive	instruction following	Rec.&Comp.&Inf.	300	100	398	594
Mind2Web	WebUI screenshot	instruction following	Rec.&Comp.	300	100	399	600
ScienceQA	Comprehensive	Comprehensive	Rec.&Comp.&Inf.	300	100	398	588
AesBench	Diffusion	Image Assessment	Rec.&Comp.&Inf.	300	100	397	553
MMvet	Comprehensive	Instruction Following	Rec.&Comp.&Inf.	214	70	259	336

BibTeX

@article{chen2024mllm,
        title={MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark},
        author={Chen, Dongping and Chen, Ruoxi and Zhang, Shilin and Liu, Yinuo and Wang, Yaochen and Zhou, Huichi and Zhang, Qihui and Zhou, Pan and Wan, Yao and Sun, Lichao},
        journal={arXiv preprint arXiv:2402.04788},
        year={2024}
      }

MLLM-as-a-Judge Team

Original Source | Taken Source