Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Experiment Results

To evaluate the mathematical reasoning robustness of existing VLMs on DynaMath, we generate 10 variants, resulting in a total of 5,010 questions to assess their performance.

Average-case Accuracy

The table below shows the Average-case accuracy of 14 models (three Closed-sourced Large Multimodal Models (LMMs) and 11 Vision Language Models (VLMs)) on DynaMath with 5,010 generated questions. Question topics (PG, SG, EL, etc) and difficulty levels (EL, HI, UN) are defined in previous table.

Worst-case Accuracy

The table below shows the Worst-case accuracy of 14 models (three Closed-sourced Large Multimodal Models (LMMs) and 11 Vision Language Models (VLMs)) on DynaMath with 5,010 generated questions. Question topics (PG, SG, EL, etc) and difficulty levels (EL, HI, UN) are defined in previous table.

Results Analysis

BibTeX

@misc{zou2024dynamic,
      title={DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models}, 
      author={Chengke Zou and Xingang Guo and Rui Yang and Junyu Zhang and Bin Hu and Huan Zhang},
      year={2024},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
}