HOME
ABOUT
- RESULTS
- differences
- BENEFITS
- HISTORY
- TEAM
- LOCATION
- FACILITIES
- BANKING
- MEMBERSHIPS
- APPROVALS
- LICENCES
- SUPPLIERS
- SPONSORSHIPS
- MEDIA
- PRIVACY
AUCTIONS
SHIPPING
FEES
- TS REWARDS
TOOLS
guides
FAQ
CONTACT
- CONNECT

VEHICLES
BRAND
- JAPANESE CARS
  - DAIHATSU
  - EUNOS
  - FORD
  - HONDA
  - ISUZU
  - LEXUS
  - MAZDA
  - MITSUBISHI
  - MITSUOKA
  - NISSAN
  - SUBARU
  - SUZUKI
  - TOYOTA
- GERMAN CARS
- AMERICAN CARS
- BRITISH CARS
- ITALIAN CARS
- FRENCH CARS
- SWEDISH CARS
- KOREAN CARS
TYPE
- mobility
- VENDING
- instruction
- TAXIS
- AMBULANCES
- FIRE ENGINES
- HEARSES
- LIMOUSINES
- COMMERCIAL
CLASS
FUEL
TRUCKS
minitrucks
- DAIHATSU
- HONDA
- MAZDA
- MITSUBISHI
- NISSAN
- SUBARU
- SUZUKI
- DUMP
- CRANE
- CAMPER
- REFRIGERATED
- 4WD
- NEW
BUSES
MOTORHOMES
- YAHOO!
- RAKUTEN
- DEALER

PARTS
- FREE REPORT
- PARTS CONTAINERS
- PARTS SYSTEMS
- PARTS PROTECTION
- BODY SHELLS
- DISMANTLING
- ONLINE PARTS
- NEW PARTS
- INTERIOR PARTS
- EXTERIOR PARTS
  - BONNETS
  - BUMPERS
  - GRILLES
  - FENDERS
  - DOORS
  - TRUNKS
  - SPOILERS
  - LIGHTS
  - EMBLEMS
  - CAMERAS
- ENGINES
- TRANSMISSIONS
- WHEELS & TYRES
  - WHEELS
  - TYRES
CUTS
PERFORMANCE PARTS
TRUCK PARTS
MOTORBIKE PARTS
- MOTORBIKE ENGINES
- MOTORBIKE ACCESSORIES

MOTORBIKES
MARINE
FORKLIFTS
MACHINERY
AGRICULTURAL
OTHER
COUNTRY
- AUSTRALIA
- CANADA
- KENYA
- MYANMAR
- NEW ZEALAND
- PAKISTAN
- TANZANIA
- UNITED STATES

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Thu, 05 Jun 2025 22:07:34 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"68421526-6f35" expires: Tue, 30 Dec 2025 00:37:57 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 771E:234FE9:97DB12:AA61B1:69531C8C accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 00:27:57 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210077-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767054477.030342,VS0,VE212 vary: Accept-Encoding x-fastly-request-id: 21b5843304997f9dc34a46eaf3b01a19a4c26fcb content-length: 5650 DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

DynaMath

A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

Chengke Zou^*,^1,², Xingang Guo^*,¹, Rui Yang^*,¹, Junyu Zhang¹, Bin Hu¹, Huan Zhang¹

¹University of Illinois at Urbana-Champaign
²University of California, Berkeley

^*Equal contribution

ICLR 2025

Paper Code

🤗

Dataset

This example demonstrates that Claude 3.5 Sonnet does not exhibit consistent performance for different variants of a math question accompanied by visual input.

Introduction

The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that state-of-the-art VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the mathematical reasoning robustness in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. We introduce DynaMath, a dynamic visual math benchmark designed for in-depth assessment of VLMs. DynaMath includes 501 high-quality, multi-topic seed questions, each represented as a Python program. Those programs are carefully designed and annotated to enable the automatic generation of a much larger set of concrete questions, including many different types of visual and textual variations. DynaMath allows us to evaluate the generalization ability of VLMs, by assessing their performance under varying input conditions of a seed question.

DynaMath Dataset

Overview

We present DynaMath, a curated evaluation dataset aimed at assessing the robustness of visual language models (VLMs) in multimodal mathematical reasoning across a wide variety of mathematical tasks with dynamic visual and textual contexts. Our benchmark consists of 501 seed questions, each represented as a Python program. There are 227 (45.3%) sourced from established visual math datasets, while 274 (54.7%) are newly collected or developed from public resources. For each seed question in the dataset, we generate M = 10 variants, resulting in a total of 5, 010 concrete questions.

Key statistics of DynaMath.

Variant number distribution.

Source composition of DynaMath.

In DynaMath, we integrate various types of variants to enrich the diversity of question generation:

Several variantion types considered in our DynaMath benchmark.

Dataset Collection

Our benchmark collection comprises two phases: seed question collection and program-based question generation. In the initial phase, we selectively curate a set of high-quality mathematics problems that necessitate reasoning based on visual information. The subsequent phase involves transforming each seed question into code-based prototypes, allowing for the generation of diverse concrete questions under randomly sampled conditions.

Generation Procedure in DynaMath.

Question Variant Examples

Seed Question 169:

Response from GPT-4o:

Seed Question 75:

Response from Gemini:

Seed Question 346:

Response from Qwen2-VL-72B:

Experiment Results

To evaluate the mathematical reasoning robustness of existing VLMs on DynaMath, we generate 10 variants, resulting in a total of 5,010 questions to assess their performance.

Average-case Accuracy

The table below shows the Average-case accuracy of 14 models (three Closed-sourced Large Multimodal Models (LMMs) and 11 Vision Language Models (VLMs)) on DynaMath with 5,010 generated questions. Question topics (PG, SG, EL, etc) and difficulty levels (EL, HI, UN) are defined in previous table.

Worst-case Accuracy

The table below shows the Worst-case accuracy of 14 models (three Closed-sourced Large Multimodal Models (LMMs) and 11 Vision Language Models (VLMs)) on DynaMath with 5,010 generated questions. Question topics (PG, SG, EL, etc) and difficulty levels (EL, HI, UN) are defined in previous table.

Results Analysis

Comparing Reasoning Robustness across different models. Here we define Reasoning Robustness (RR) as the ratio between the average-case performance and the worst-case performance.

Comparing Reasoning Robustness across different topics. Here we define Reasoning Robustness (RR) as the ratio between the average-case performance and the worst-case performance.

The Repetition Consistency for different models over 5 repetitions. The repetition consistency represents the model's confidence in the answer.

Error Analysis of Claude-3.5 Sonnet.

BibTeX

@misc{zou2024dynamic,
      title={DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models}, 
      author={Chengke Zou and Xingang Guo and Rui Yang and Junyu Zhang and Bin Hu and Huan Zhang},
      year={2024},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
}

This website is website adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

HOME
ABOUT
AUCTIONS
SHIPPING
FEES
TOOLS
HOW
FAQ
CONTACT

Original Source | Taken Source