| CARVIEW |
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
2Zhejiang University of Technology, 3Lehigh University
- A Benchmark. We are the first to develop a comprehensive benchmark MLLM-AS-A-JUDGE in multimodal domains, with human annotations to assess the judging capability of MLLMs in tasks of Scoring Evaluation, Pair Comparison and Batch Ranking.
- Two Datasets. We curate two human preference datasets with high-quality questions MLLM-AS-A-JUDGE-HQ and MLLM-AS-A-JUDGE-HARD dataset with hallucination instances. They can serve as a rigorous testing ground to facilitate the development of MLLMs.
- Findings and Implications. Our evaluation of mainstream MLLMs reveals that while MLLMs exhibit alignment with human judgments in pair comparison tasks, notable discrepancies can be found in scoring evaluation and batch ranking. Furthermore, our findings reveal that MLLMs exhibit a range of biases and hallucinations, along with inconsistent judgments during the evaluation process, representing significant hurdles in establishing MLLMs as reliable judges.
Abstract
Takeaway
Experiment Setups
- Models. We evaluate the judging performance of eleven leading MLLMs – GPT-4V, Gemini-Pro-Vision-1.0, LLaVA-1.5-13b, LLaVA-1.6- 7b/13b/34b, Qwen-VL-Plus/Max and CogVLM – across three distinct evaluation settings. Adapting the “Analyze-then-Judge” paradigm, which is a one-step CoT approach, we first ask MLLMs to analyze responses and then provide a judgment based on their analysis.
- Metrics. After collecting responses from MLLM judgments, we quantify their alignment with human
annotations across three
settings, employing distinct metrics as follows:
- Scoring Evaluation: Following LLM-as-a-Judge , we compute the Pearson similarity between the MLLMs’ judgments and human ratings across different sub-datasets.
- Pair Comparison: We assess the similarity between the MLLM judgments and human decisions using accuracy, F1-score, and recall to assess the judging abilities of models.
- Batch Evaluation: We consolidate the ranking results into a singular sequence and employ the Normalized Levenshtein distance to evaluate the similarity between judgments from MLLMs and human annotation.
- Apart from traditional metrics for similarity assessment
between judgments from MLLMs and humans, we further
evaluate the judgments provided by MLLMs to uncover
latent bias and hallucination in 10 datasets. We also invite
human annotators for further validation, focusing on the
following aspects:
- Human Agreement: This involves a simple ‘yes’ or ‘no’ response to assess agreement with the MLLM judgments. While some judgments might appear reasonable, they may still be considered incorrect due to unique human perspectives. Hence, we conduct experiments on human agreement to address situations that traditional metrics may not adequately capture.
- Analysis Grading: Each MLLM analysis is assigned a score from 1 to 5, considering relevance, accuracy, creativity, and response granularity, detailed in Appendix F.
- Hallucination Detection: Given the propensity for hallucination issues in the complex reasoning chains and longterm vision-language contexts of MLLMs, we task human annotators with identifying any hallucinations in the analyses of MLLM judgments, adhering to established definitions of vision and language hallucination.
| Settings | MLLM | Categories | Ave. | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| COCO | C.C. | Diff. | Graphics | Math | Text | WIT | Chart | VisIT | CC-3M | M2W | SciQA | Aes | MM-Vet | |||
| Score (↑) | LLaVA-1.5-13b | 0.247 | 0.227 | 0.060 | 0.242 | 0.093 | 0.245 | 0.109 | 0.237 | 0.177 | 0.071 | 0.424 | 0.279 | 0.414 | 0.322 | 0.225 |
| LLaVA-1.6-34b | 0.285 | 0.251 | -0.012 | 0.262 | 0.238 | 0.258 | 0.151 | 0.318 | 0.198 | 0.109 | 0.022 | 0.206 | 0.025 | 0.265 | 0.184 | |
| Gemini | 0.262 | 0.408 | - | 0.400 | 0.228 | 0.222 | 0.418 | 0.343 | 0.336 | 0.374 | 0.324 | 0.073 | 0.360 | 0.207 | 0.304 | |
| GPT-4V | 0.454 | 0.507 | 0.458 | 0.645 | 0.606 | 0.624 | 0.579 | 0.645 | 0.620 | 0.431 | 0.185 | 0.383 | 0.401 | 0.326 | 0.490 | |
| Qwen-vl-max | 0.311 | 0.117 | 0.072 | 0.218 | 0.175 | 0.196 | 0.028 | 0.312 | 0.151 | 0.045 | 0.244 | 0.115 | 0.177 | 0.216 | 0.170 | |
| Pair w. Tie (↑) | LLaVA-1.5-13b | 0.273 | 0.478 | 0.286 | 0.273 | 0.657 | 0.510 | 0.369 | 0.383 | 0.456 | 0.484 | 0.347 | 0.223 | 0.389 | 0.254 | 0.384 |
| LLaVA-1.6-34b | 0.493 | 0.600 | 0.570 | 0.300 | 0.374 | 0.551 | 0.543 | 0.254 | 0.398 | 0.392 | 0.513 | 0.434 | 0.524 | 0.499 | 0.460 | |
| Gemini | 0.616 | 0.787 | - | 0.650 | 0.436 | 0.664 | 0.605 | 0.500 | 0.660 | 0.560 | 0.370 | 0.262 | 0.190 | 0.312 | 0.509 | |
| GPT-4V | 0.696 | 0.824 | 0.847 | 0.639 | 0.564 | 0.673 | 0.679 | 0.657 | 0.640 | 0.612 | 0.521 | 0.415 | 0.606 | 0.529 | 0.636 | |
| Qwen-vl-max | 0.403 | 0.464 | 0.372 | 0.494 | 0.438 | 0.500 | 0.533 | 0.479 | 0.421 | 0.421 | 0.411 | 0.392 | 0.325 | 0.474 | 0.438 | |
| Pair w.o. Tie (↑) | LLaVA-1.5-13b | 0.327 | 0.537 | 0.302 | 0.300 | 0.726 | 0.684 | 0.600 | 0.610 | 0.648 | 0.583 | 0.449 | 0.443 | 0.498 | 0.344 | 0.504 |
| LLaVA-1.6-34b | 0.607 | 0.824 | 0.855 | 0.402 | 0.587 | 0.750 | 0.758 | 0.381 | 0.503 | 0.564 | 0.712 | 0.679 | 0.694 | 0.762 | 0.648 | |
| Gemini | 0.717 | 0.840 | - | 0.770 | 0.678 | 0.793 | 0.688 | 0.658 | 0.711 | 0.652 | 0.471 | 0.358 | 0.265 | 0.400 | 0.615 | |
| GPT-4V | 0.804 | 0.870 | 0.922 | 0.807 | 0.801 | 0.805 | 0.734 | 0.849 | 0.761 | 0.703 | 0.699 | 0.647 | 0.755 | 0.659 | 0.773 | |
| Qwen-vl-max | 0.657 | 0.674 | 0.556 | 0.667 | 0.635 | 0.732 | 0.647 | 0.638 | 0.560 | 0.586 | 0.608 | 0.646 | 0.741 | 0.662 | 0.644 | |
| Batch (↓) | LLaVA-1.5-13b | 0.577 | 0.492 | 0.562 | 0.535 | 0.598 | 0.650 | 0.616 | 0.644 | 0.620 | 0.563 | 0.639 | 0.563 | 0.650 | 0.652 | 0.597 |
| LLaVA-1.6-34b | 0.449 | 0.411 | 0.500 | 0.561 | 0.575 | 0.544 | 0.483 | 0.552 | 0.542 | 0.479 | 0.529 | 0.437 | 0.500 | 0.450 | 0.501 | |
| Gemini | 0.287 | 0.299 | - | 0.473 | 0.462 | 0.430 | 0.344 | 0.520 | 0.426 | 0.357 | 0.613 | 0.412 | 0.467 | 0.529 | 0.432 | |
| GPT-4V | 0.318 | 0.353 | 0.070 | 0.385 | 0.348 | 0.319 | 0.290 | 0.347 | 0.300 | 0.402 | 0.597 | 0.462 | 0.453 | 0.411 | 0.361 | |
| Qwen-vl-max | 0.477 | 0.407 | 0.500 | 0.480 | 0.507 | 0.515 | 0.493 | 0.539 | 0.468 | 0.407 | 0.563 | 0.503 | 0.444 | 0.500 | 0.486 | |
| Settings | MLLM | Categories | Average | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| COCO | C.C. | Diffusion | Graphics | Math | Text | WIT | Chart | VisIT | CC-3M | |||
| Score (↑) | Gemini | 0.783 | 0.739 | - | 0.618 | 0.536 | 0.621 | 0.749 | 0.630 | 0.712 | 0.702 | 0.677 |
| GPT-4V | 0.799 | 0.725 | 0.506 | 0.688 | 0.638 | 0.706 | 0.714 | 0.676 | 0.779 | 0.754 | 0.699 | |
| Pair (↑) | Gemini | 0.705 | 0.833 | - | 0.733 | 0.520 | 0.717 | 0.827 | 0.620 | 0.853 | 0.703 | 0.724 |
| GPT-4V | 0.821 | 0.926 | 0.873 | 0.794 | 0.618 | 0.752 | 0.790 | 0.796 | 0.797 | 0.766 | 0.793 | |
| Batch (↓) | Gemini | 0.642 | 0.639 | - | 0.333 | 0.330 | 0.473 | 0.511 | 0.315 | 0.422 | 0.554 | 0.469 |
| GPT-4V | 0.663 | 0.639 | 0.912 | 0.536 | 0.475 | 0.615 | 0.641 | 0.640 | 0.622 | 0.467 | 0.621 | |
| Settings | MLLM | Categories | Ave. | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| COCO | C.C. | Diffusion | Graphics | Math | Text | WIT | Chart | VisIT | CC-3M | |||
| Score (↑) | GPT-4V | 0.454 | 0.507 | 0.458 | 0.645 | 0.606 | 0.624 | 0.579 | 0.645 | 0.620 | 0.431 | 0.557 |
| GPT-4V (+CoT) | 0.246 | 0.165 | 0.192 | 0.385 | 0.397 | 0.400 | 0.298 | 0.443 | 0.423 | 0.038 | 0.299 | |
| Gemini | 0.262 | 0.408 | - | 0.400 | 0.228 | 0.222 | 0.418 | 0.343 | 0.336 | 0.374 | 0.299 | |
| Gemini (+CoT) | 0.127 | 0.068 | 0.117 | 0.220 | 0.132 | 0.182 | 0.105 | 0.140 | 0.222 | 0.128 | 0.144 | |
| Pair w. Tie (↑) | GPT-4V | 0.696 | 0.824 | 0.847 | 0.639 | 0.564 | 0.673 | 0.679 | 0.657 | 0.640 | 0.612 | 0.683 |
| GPT-4V (+CoT) | 0.507 | 0.657 | 0.561 | 0.601 | 0.515 | 0.580 | 0.489 | 0.521 | 0.646 | 0.553 | 0.563 | |
| Gemini | 0.616 | 0.787 | - | 0.650 | 0.436 | 0.664 | 0.605 | 0.500 | 0.660 | 0.560 | 0.609 | |
| Gemini (+CoT) | 0.233 | 0.239 | 0.420 | 0.207 | 0.284 | 0.329 | 0.352 | 0.357 | 0.247 | 0.239 | 0.291 | |
| Pair w.o. Tie (↑) | GPT-4V | 0.804 | 0.870 | 0.922 | 0.807 | 0.801 | 0.805 | 0.734 | 0.849 | 0.761 | 0.703 | 0.806 |
| GPT-4V (+CoT) | 0.673 | 0.821 | 0.845 | 0.707 | 0.738 | 0.787 | 0.548 | 0.756 | 0.753 | 0.654 | 0.728 | |
| Gemini | 0.717 | 0.840 | - | 0.770 | 0.678 | 0.793 | 0.688 | 0.658 | 0.711 | 0.652 | 0.723 | |
| Gemini (+CoT) | 0.267 | 0.275 | 0.573 | 0.264 | 0.414 | 0.424 | 0.427 | 0.511 | 0.299 | 0.319 | 0.377 | |
| Batch (↓) | GPT-4V | 0.323 | 0.344 | 0.092 | 0.401 | 0.367 | 0.341 | 0.302 | 0.364 | 0.313 | 0.407 | 0.325 |
| GPT-4V (+CoT) | 0.428 | 0.416 | - | 0.427 | 0.434 | 0.401 | 0.366 | 0.406 | 0.422 | 0.472 | 0.419 | |
| Gemini | 0.287 | 0.299 | - | 0.473 | 0.462 | 0.430 | 0.344 | 0.520 | 0.426 | 0.357 | 0.400 | |
| Gemini (+CoT) | 0.441 | 0.481 | 0.542 | 0.595 | 0.494 | 0.533 | 0.483 | 0.569 | 0.486 | 0.463 | 0.509 | |
| MLLM | Settings | Score (↑) | Pair (↑) | Batch (↓) | |
|---|---|---|---|---|---|
| Pearson | w. Tie | w.o. Tie | Edit Dis. | ||
| LLaMA2-70b | Vision Exp | 0.060 | 0.404 | 0.550 | 0.643 |
| No Vision | 0.126 | 0.374 | 0.537 | 0.583 | |
| Mixtral-8x7b | Vision Exp | 0.054 | 0.374 | 0.543 | 0.603 |
| No Vision | 0.151 | 0.478 | 0.731 | 0.546 | |
| GPT-3.5 | Vision Exp | 0.154 | 0.453 | 0.591 | 0.473 |
| No Vision | 0.223 | 0.459 | 0.644 | 0.504 | |
| GPT-4V | Vision Exp | 0.435 | 0.544 | 0.878 | 0.400 |
| No Vision | 0.299 | 0.491 | 0.868 | 0.394 | |
| Gemini | Vision Exp | 0.120 | 0.438 | 0.785 | 0.472 |
| No Vision | 0.108 | 0.433 | 0.758 | 0.470 | |
| Settings | MLLM | Categories | Ave. | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| COCO | C.C. | Diff. | Graphics | Math | Text | WIT | Chart | VisIT | CC-3M | M2W | SciQA | Aes | MM-Vet | |||
| Score (↑) | CogVLM | 0.107 | -0.048 | 0.049 | -0.158 | 0.065 | 0.097 | -0.131 | -0.135 | 0.278 | 0.157 | - | - | - | - | 0.028 |
| GPT-4V | 0.454 | 0.507 | 0.458 | 0.645 | 0.606 | 0.624 | 0.579 | 0.645 | 0.620 | 0.431 | 0.185 | 0.383 | 0.401 | 0.326 | 0.490 | |
| LLaVA-1.5-13b | 0.247 | 0.227 | 0.060 | 0.242 | 0.093 | 0.245 | 0.109 | 0.237 | 0.177 | 0.071 | 0.424 | 0.279 | 0.414 | 0.322 | 0.225 | |
| LLaVA-1.6-7b | 0.300 | 0.243 | 0.058 | 0.200 | 0.090 | 0.193 | 0.044 | 0.085 | 0.228 | 0.026 | 0.299 | 0.156 | 0.148 | 0.171 | 0.160 | |
| LLaVA-1.6-13b | 0.289 | 0.226 | -0.110 | 0.078 | 0.056 | 0.086 | 0.062 | 0.120 | 0.163 | 0.200 | 0.140 | 0.136 | 0.163 | 0.183 | 0.128 | |
| LLaVA-1.6-34b | 0.285 | 0.251 | -0.012 | 0.262 | 0.238 | 0.258 | 0.151 | 0.318 | 0.198 | 0.109 | 0.022 | 0.206 | 0.025 | 0.265 | 0.184 | |
| Gemini-Pro | 0.262 | 0.408 | - | 0.400 | 0.228 | 0.222 | 0.418 | 0.343 | 0.336 | 0.374 | 0.324 | 0.073 | 0.360 | 0.207 | 0.304 | |
| Gemini-Pro* | 0.211 | 0.230 | 0.114 | 0.146 | 0.060 | 0.095 | 0.041 | 0.160 | 0.174 | 0.177 | 0.282 | 0.030 | 0.329 | 0.144 | 0.157 | |
| Qwen-vl-max | 0.311 | 0.117 | 0.072 | 0.218 | 0.175 | 0.196 | 0.028 | 0.312 | 0.151 | 0.045 | 0.244 | 0.115 | 0.177 | 0.216 | 0.170 | |
| Qwen-vl-plus | -0.050 | 0.195 | 0.019 | 0.126 | 0.106 | 0.161 | 0.151 | 0.089 | 0.128 | 0.106 | 0.268 | 0.092 | 0.347 | -0.019 | 0.123 | |
| Qwen-vl-chat | -0.012 | -0.012 | 0.033 | -0.422 | 0.011 | -0.028 | 0.021 | 0.036 | -0.060 | 0.083 | 0.092 | -0.017 | -0.040 | 0.115 | -0.014 | |
| Pair w. Tie (↑) | CogVLM | 0.548 | 0.409 | 0.562 | 0.613 | 0.412 | 0.250 | 0.273 | 0.262 | 0.324 | 0.433 | - | - | - | - | 0.409 |
| GPT-4V | 0.696 | 0.824 | 0.847 | 0.639 | 0.564 | 0.673 | 0.679 | 0.657 | 0.640 | 0.612 | 0.521 | 0.415 | 0.606 | 0.529 | 0.636 | |
| LLaVA-1.5-13b | 0.273 | 0.478 | 0.286 | 0.273 | 0.657 | 0.510 | 0.369 | 0.383 | 0.456 | 0.484 | 0.347 | 0.223 | 0.389 | 0.254 | 0.384 | |
| LLaVA-1.6-7b | 0.493 | 0.571 | 0.550 | 0.383 | 0.314 | 0.507 | 0.500 | 0.352 | 0.401 | 0.402 | 0.563 | 0.310 | 0.544 | 0.463 | 0.454 | |
| LLaVA-1.6-13b | 0.493 | 0.586 | 0.590 | 0.333 | 0.339 | 0.507 | 0.587 | 0.296 | 0.454 | 0.459 | 0.506 | 0.322 | 0.545 | 0.448 | 0.462 | |
| LLaVA-1.6-34b | 0.493 | 0.600 | 0.570 | 0.300 | 0.374 | 0.551 | 0.543 | 0.254 | 0.398 | 0.392 | 0.513 | 0.434 | 0.524 | 0.499 | 0.460 | |
| Gemini-Pro | 0.616 | 0.787 | - | 0.650 | 0.436 | 0.664 | 0.605 | 0.500 | 0.660 | 0.560 | 0.370 | 0.262 | 0.190 | 0.312 | 0.509 | |
| Gemini-Pro* | 0.273 | 0.273 | 0.240 | 0.324 | 0.237 | 0.275 | 0.136 | 0.377 | 0.232 | 0.294 | 0.368 | 0.260 | 0.209 | 0.303 | 0.272 | |
| Qwen-vl-max | 0.403 | 0.464 | 0.372 | 0.494 | 0.438 | 0.500 | 0.533 | 0.479 | 0.421 | 0.421 | 0.411 | 0.392 | 0.325 | 0.474 | 0.438 | |
| Qwen-vl-plus | 0.479 | 0.507 | 0.650 | 0.450 | 0.328 | 0.522 | 0.500 | 0.380 | 0.453 | 0.383 | 0.577 | 0.321 | 0.601 | 0.457 | 0.472 | |
| Qwen-vl-chat | 0.493 | 0.486 | 0.480 | 0.311 | 0.248 | 0.406 | 0.543 | 0.310 | 0.332 | 0.292 | 0.547 | 0.298 | 0.507 | 0.478 | 0.409 | |
| Pair w.o. Tie (↑) | CogVLM | 0.654 | 0.450 | 0.643 | 0.704 | 0.481 | 0.292 | 0.500 | 0.423 | 0.500 | 0.591 | - | - | - | - | 0.524 |
| GPT-4V | 0.804 | 0.870 | 0.922 | 0.807 | 0.801 | 0.805 | 0.734 | 0.849 | 0.761 | 0.703 | 0.699 | 0.647 | 0.755 | 0.659 | 0.773 | |
| LLaVA-1.5-13b | 0.327 | 0.537 | 0.302 | 0.300 | 0.726 | 0.684 | 0.600 | 0.610 | 0.648 | 0.583 | 0.449 | 0.443 | 0.498 | 0.344 | 0.504 | |
| LLaVA-1.6-7b | 0.593 | 0.597 | 0.618 | 0.434 | 0.468 | 0.636 | 0.561 | 0.471 | 0.436 | 0.466 | 0.633 | 0.621 | 0.568 | 0.705 | 0.558 | |
| LLaVA-1.6-13b | 0.614 | 0.612 | 0.663 | 0.382 | 0.487 | 0.618 | 0.659 | 0.420 | 0.503 | 0.549 | 0.576 | 0.598 | 0.565 | 0.620 | 0.562 | |
| LLaVA-1.6-34b | 0.607 | 0.824 | 0.855 | 0.402 | 0.587 | 0.750 | 0.758 | 0.381 | 0.503 | 0.564 | 0.712 | 0.679 | 0.694 | 0.762 | 0.648 | |
| Gemini-Pro | 0.717 | 0.840 | - | 0.770 | 0.678 | 0.793 | 0.688 | 0.658 | 0.711 | 0.652 | 0.471 | 0.358 | 0.265 | 0.400 | 0.615 | |
| Gemini-Pro* | 0.311 | 0.340 | 0.308 | 0.419 | 0.336 | 0.366 | 0.200 | 0.439 | 0.290 | 0.358 | 0.469 | 0.336 | 0.266 | 0.398 | 0.345 | |
| Qwen-vl-max | 0.657 | 0.674 | 0.556 | 0.667 | 0.635 | 0.732 | 0.647 | 0.638 | 0.560 | 0.586 | 0.608 | 0.646 | 0.741 | 0.662 | 0.644 | |
| Qwen-vl-plus | 0.596 | 0.556 | 0.771 | 0.554 | 0.463 | 0.735 | 0.575 | 0.535 | 0.521 | 0.510 | 0.659 | 0.612 | 0.627 | 0.659 | 0.598 | |
| Qwen-vl-chat | 0.603 | 0.523 | 0.625 | 0.333 | 0.386 | 0.574 | 0.625 | 0.431 | 0.370 | 0.396 | 0.618 | 0.594 | 0.539 | 0.755 | 0.527 | |
| Batch (↓) | GPT-4V | 0.318 | 0.353 | 0.070 | 0.385 | 0.348 | 0.319 | 0.290 | 0.347 | 0.300 | 0.402 | 0.597 | 0.462 | 0.453 | 0.411 | 0.361 |
| LLaVA-1.5-13b | 0.577 | 0.492 | 0.562 | 0.535 | 0.598 | 0.650 | 0.616 | 0.644 | 0.620 | 0.563 | 0.639 | 0.563 | 0.650 | 0.652 | 0.597 | |
| LLaVA-1.6-7b | 0.575 | 0.538 | 0.618 | 0.462 | 0.601 | 0.598 | 0.564 | 0.679 | 0.586 | 0.503 | 0.507 | 0.403 | 0.525 | 0.565 | 0.552 | |
| LLaVA-1.6-13b | 0.614 | 0.612 | 0.663 | 0.382 | 0.487 | 0.618 | 0.659 | 0.420 | 0.503 | 0.549 | 0.531 | 0.415 | 0.500 | 0.557 | 0.536 | |
| LLaVA-1.6-34b | 0.449 | 0.411 | 0.500 | 0.561 | 0.575 | 0.544 | 0.483 | 0.552 | 0.542 | 0.479 | 0.529 | 0.437 | 0.500 | 0.450 | 0.501 | |
| Gemini-Pro | 0.287 | 0.299 | - | 0.473 | 0.462 | 0.430 | 0.344 | 0.520 | 0.426 | 0.357 | 0.613 | 0.412 | 0.467 | 0.529 | 0.432 | |
| Gemini-Pro* | 0.378 | 0.370 | - | 0.572 | 0.508 | 0.452 | 0.417 | 0.572 | 0.492 | 0.434 | 0.636 | 0.412 | 0.489 | 0.506 | 0.480 | |
| Qwen-vl-max | 0.477 | 0.407 | 0.500 | 0.480 | 0.507 | 0.515 | 0.493 | 0.539 | 0.468 | 0.407 | 0.563 | 0.503 | 0.444 | 0.500 | 0.486 | |
| Qwen-vl-plus | 0.640 | 0.616 | 0.500 | 0.666 | 0.644 | 0.634 | 0.592 | 0.747 | 0.671 | 0.540 | 0.488 | 0.409 | 0.523 | 0.470 | 0.581 | |
| Qwen-vl-chat | 0.733 | 0.701 | 0.500 | 0.669 | 0.638 | 0.554 | 0.638 | 0.723 | 0.687 | 0.668 | 0.500 | 0.389 | 0.531 | 0.572 | 0.607 | |
Empirical Results
MLLM Judgment vs Human Annotation
- Scoring Evaluation: GPT-4V demonstrated the highest similarity to human scoring with a similarity score of 0.557. In contrast, Gemini achieved only 0.332, with LLaVA and CogVLM scoring even lower. This discrepancy is primarily due to Gemini’s tendency to assign scores around 4 points, seldom giving 1 or 2 points. LLaVA and CogVLM show a similar pattern to Gemini, predominantly assigning scores around 4 points. We attribute this to a ‘High-Score’ Bias, akin to the ‘Yes/No’ bias, which may result from an imbalance in positive and negative judging instructions in their training data, severely limits their ability to provide just and varied scores in scoring settings. In comparison, GPT-4V’s scores are more evenly distributed and align closely with human preferences.
- Pair Comparison: GPT-4V outshines other MLLMs in pair comparison tasks, achieving 0.683 in tie settings and 0.806 in non-tie settings, surpassing 0.8 in many datasets, which indicate a strong alignment with human preferences. Gemini, LLaVA, and CogVLM show a marked preference for declaring a clear winner, possibly due to a lack of tie situations in their training, leading to biased judgments. It’s also interesting that the frequency of ties given by GPT-4V closely mirrors that of human judges, suggesting similar thresholds for tie decisions.
- Batch Ranking: GPT-4V aligns more closely with human ranking results, indicating a significant lead with a mean Levenshtein Distance of 0.313. However, there is still substantial room for improvement in this task for all MLLMs. Notably, CogVLM is unable to provide a full ranking in this context, offering only the top choice; so it was excluded from this comparison; LLaVA also exhibits position bias influenced by prompt structure, often replicating judgments seen in example prompts, which complicates its ability to produce fair judgments.
MLLM Judging Consistency
To be a reliable judge, consistent decision-making across repeated evaluations of the same query is crucial. For this purpose, we conducted six repeated tests with MLLM judgments and calculated the weighted average consistency scores and Majority Consistency Criterion ratios for GPT-4V and Gemini. Despite a higher temperature setting, GPT-4V substantially outperforms Gemini across all tasks. Particularly in Pair Comparison, GPT-4V achieves a higher consistency score of 0.675, but it encounters difficulties in maintaining similar levels of consistency in Scoring and Batch Ranking tasks, with scores dropping to 0.611 and 0.418, indicating the challenge of producing qualified and convincing judgments.
Vision Perception benefits Judging
We explore the feasibility of using LLMs for judging textbased responses without directly analyzing the original images. This involves two approaches: omitting vision information entirely and providing a detailed description of the picture. Surprisingly, we find that LLMs’ performance in multimodal judging tasks significantly improved with picture descriptions, achieving a Pearson similarity of 0.435 in Scoring Evaluation tasks, markedly outperformed judgments made without any vision perception. Notably, in non-tie Pair Comparison, MLLMs with detailed vision descriptions even exceed the standard performance of MLLMs in judging. This suggests that MLLMs may lack certain human-like judging capabilities, while LLMs can effectively judge multimodal tasks when provided with comprehensive task-related descriptions.
Human Agreement
Our manual evaluation of MLLMs in judging, focusing on agreement and scoring, revealed notable findings. GPT-4V achieved around 70% human agreement across all settings, excelling in the Pair Comparison task with 79.3% agreement. Specifically, GPT-4V reached 78% in human agreement for Pair Comparison, with Gemini close at 72%, indicating strong performance in most sample pairs and supporting the idea that large models excel in pairwise distinctions (Zheng et al., 2023b), though improvements are needed in other judging settings. In the Scoring Evaluation task, GPT-4V achieved a 70% human agreement rate, peaking at 79.9% in MS-COCO, while Gemini maintained an average rate of 67.7%. To assess the consistency of MLLM judging quality across multiple responses to a single image-instruction pair, we employed the Mean Absolute Deviation (MAD) metric. This measures the average absolute variance between individual scores and the mean, thereby gauging quality variability. Figure 16 demonstrates that GPT-4V exhibits lower variation in quality assessments, indicating more consistent and reliable judgment compared to Gemini, which is further evidenced by its superior performance. However, in Batch Ranking, both models showed reduced human performance. GPT-4V managed 69% human agreement, and Gemini only 47%. Additionally, their analyses received lower scores, especially in complex tasks like Math and graphics information. This suggests that the models’ inherent capabilities may not fully support understanding and completing intricate user instructions to provide accurate judgments.
Bias and Hallucination
Egocentric Bias
It means models assign higher scores to their own responses while scoring others lower. GPT-4V exhibits a slight degree of Egocentricity. This bias contrasts with Gemini, which tends to judge each response more equitably, displaying a similar scoring distribution across different sources. Further investigation into the rationale behind GPT-4V’s self-favoring behavior indicated that its judgments align closely with its own ethical guidelines. For instance, when faced with questions involving user privacy, GPT-4V’s responses typically emphasize privacy preservation and refuse to engage, leading to higher self-scoring in these scenarios. Despite efforts in prompt engineering to encourage impartiality, these models inherently rely on their built-in judgment criteria retained from post-alignment, which can lead to a divergence from human preferences. Such a discrepancy highlights the complexity of aligning MLLM judgments with human standards.
Position Bias
It means a model consistently favors answers in specific positions, often influenced by training data that typically places correct responses at the beginning or end of prompts. Figure 4 illustrates this bias in LLaVA and CogVLM, showing a distinct preference for one particular option in Pair Comparison tasks, habitually selecting the answer in their favored position. Such bias might arise from their restricted instruction-following capabilities, making their judgments disproportionately influenced by the structure of prompts. For example, when a Batch Ranking prompt includes a sample answer sequence like ‘ABCD’, LLaVA tends to replicate this sequence in its responses with a high frequency of 88.2%, significantly more than other sequences. However, introducing multiple examples in the prompt appears to lessen this bias, as evidenced by a reduced Position Bias score of 53.3% when two examples are provided. This suggests that augmenting prompts with more examples might help guide these models to adhere more closely to the given instructions.
Length Bias
Length bias means models prefer longer answers over concise but correct ones, also known as verbosity bias (Zheng et al., 2023b). As illustrated in Figure 6, both GPT-4V and Gemini are inclined to award higher scores and preference to longer content. To delve deeper into this bias, we conducted an expanded scoring experiment using GPT-4, which lacks vision perception, to semantically increase the length of answers without altering their original meaning. As shown in Figure 7, the results showed a noticeable increase in the scores assigned by GPT-4V and Gemini, averaging gains of 0.6 and 0.75 points, respectively. This finding conclusively demonstrates the presence of Verbosity Bias, suggesting that MLLMs might exploit extended text as a backdoor method to achieve higher scores.
Hallucination Detection and Mitigation
We observe a higher incidence of hallucinations in Batch Ranking tasks compared to Pair Comparison and Score Evaluation, which may stem from misunderstandings of the long-term context. Delving deeper, we encounter more severe language hallucinations, including miscomprehensions of textual meanings or errors in text retrieval, which significantly impact the accuracy and reliability of the final judgments. To mitigate hallucination, we perform multi-step CoT on MLLM-AS-A-JUDGE-HARD by telling MLLMs to judge step-by-step, perform extra reasoning steps before normal “Analyze-then-Judge” setting, following: 1) imageinstruction 2) image 3) instruction. As shown in Table 6 in paper, hallucinations are mitigated across all settings, with extra reasoning on image information showing the most notable improvement in both score and pair tasks. Notably, in the Batch Ranking task, which involves analyzing longer texts, more reasoning steps significantly reduce hallucinations.
Models
Detailed Selected Dataset
| Dataset | Image type | Task | Ability Required | Image-Inst. Pair | Batch | Score | Pair |
|---|---|---|---|---|---|---|---|
| Conceptual Captions | Web Image | Captioning | Rec.&Comp. | 300 | 100 | 398 | 597 |
| ChartQA | Chart | Chart reasoning | Rec.&Comp. | 300 | 100 | 400 | 600 |
| InfographicVQA | Infographics | Graph reasoning | Rec.&Comp. | 300 | 100 | 398 | 573 |
| MathVista | Mathematics | Math reasoning | Rec.&Comp.&Inf. | 300 | 200 | 793 | 1185 |
| TextVQA | Text | Text reading | Rec.&Comp. | 300 | 100 | 399 | 582 |
| WIT | Multilingual text | Transcription | Rec.&Mul. | 300 | 100 | 399 | 582 |
| MS COCO | Real-life scene | Image Segmentation | Rec.&Comp. | 300 | 100 | 398 | 617 |
| DiffusionDB | Diffusion | Comprehensive | Rec.&Comp.&Inf. | 300 | 100 | 299 | 300 |
| CC-3M Concept-balanced | Comprehensive | Comprehensive | Rec.&Comp.&Inf. | 300 | 100 | 396 | 597 |
| VisIT-Bench | Comprehensive | instruction following | Rec.&Comp.&Inf. | 300 | 100 | 398 | 594 |
| Mind2Web | WebUI screenshot | instruction following | Rec.&Comp. | 300 | 100 | 399 | 600 |
| ScienceQA | Comprehensive | Comprehensive | Rec.&Comp.&Inf. | 300 | 100 | 398 | 588 |
| AesBench | Diffusion | Image Assessment | Rec.&Comp.&Inf. | 300 | 100 | 397 | 553 |
| MMvet | Comprehensive | Instruction Following | Rec.&Comp.&Inf. | 214 | 70 | 259 | 336 |
BibTeX
@article{chen2024mllm,
title={MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark},
author={Chen, Dongping and Chen, Ruoxi and Zhang, Shilin and Liu, Yinuo and Wang, Yaochen and Zhou, Huichi and Zhang, Qihui and Zhou, Pan and Wan, Yao and Sun, Lichao},
journal={arXiv preprint arXiv:2402.04788},
year={2024}
}
MLLM-as-a-Judge Team