CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Tue, 16 Dec 2025 18:49:13 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"6941a9a9-829a" expires: Sun, 28 Dec 2025 22:51:32 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 363C:2BC55:80E2DC:90AED1:6951B21B accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 22:41:32 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210098-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766961692.966451,VS0,VE204 vary: Accept-Encoding x-fastly-request-id: 4b0b4c2c0da74e199498b0982f7bde69302c074c content-length: 4890 Point Arena

Point Arena

Probing Multimodal Grounding Through Language-Guided Pointing

Long Cheng^1∗ Jiafei Duan^1,2∗ Yi Ru Wang^1† Haoquan Fang^1,2† Boyang Li^1†
Yushan Huang¹ Elvis Wang³ Ainaz Eftekhar^1,2 Jason Lee^1,2 Wentao Yuan¹
Rose Hendrix² Noah A. Smith^1,2 Fei Xia¹ Dieter Fox¹ Ranjay Krishna^1,2

¹University of Washington ²Allen Institute for Artificial Intelligence
³Anderson Collegiate Vocational Institute

^∗Co-first authors. ^†Co-second authors.

View Rankings View Code Read Paper Download Data Point-Battle (Vote)

Point-Bench

Standardized evaluation of precise spatial alignment between language and vision

Rank	Model	Affordance	Spatial	Reasoning	Steerability	Counting	Average

Loading data...

Download Complete Data (CSV)

Point-Bench Gallery

Image

Mask

Query: "Point to the object to the right of the television."

Image

Mask

Query: "Point to the object used to stir or mix things."

Image

Mask

Query: "Point to the structure that cars use to travel over the river."

Image

Mask

Query: "Point to the window to the right of the red door."

Image

Mask

Query: "Point to the screen indicating the speed."

Image

Mask

Query: "Point to where items could be placed on the bike."

Image

Mask

Query: "Point to the object that most likely contains policies."

Image

Mask

Query: "Point to all the cups."

Image

Mask

Query: "Point to what could drive people from one place to another in the airport."

Image

Mask

Query: "Point to the animal that lays eggs."

Dataset Analysis

Category Distribution

Comprehensive Dataset Analysis

Point-Battle

Performance disparities across model types and prompt strategies

Rank	Model	Elo Rating	Wins	Losses	Games	Win Rate	Lower CI	Upper CI

Loading data...

Download Complete Data (CSV)

Point-Act

Diverse datasets for standardized scenarios and rigorous evaluation protocols

Challenge	Success Rate	SUS Score

Loading data...

Download Complete Data (CSV)

About Point Arena

Our Mission

Point Arena is the first open and unified evaluation platform specifically designed to assess language-guided pointing capabilities in multi-modal large language models (MLLMs).

Despite recent advances in visual reasoning, existing benchmarks lack fine-grained grounding tasks that require precise spatial alignment between language and vision. Point Arena addresses this gap by offering standardized scenarios, diverse datasets, and rigorous evaluation protocols.

Research Findings

Performance Disparities: Significant differences across model types and prompt strategies
Current Limitations: Identified challenges in spatial reasoning and grounding fidelity
Future Directions: New paths for multi-modal alignment research

Point Arena is publicly available and aims to facilitate reproducible and transparent progress in multi-modal understanding.

Citation

@misc{cheng2025pointarenaprobingmultimodalgrounding,
      title={PointArena: Probing Multimodal Grounding Through Language-Guided Pointing}, 
      author={Long Cheng and Jiafei Duan and Yi Ru Wang and Haoquan Fang and Boyang Li and Yushan Huang and Elvis Wang and Ainaz Eftekhar and Jason Lee and Wentao Yuan and Rose Hendrix and Noah A. Smith and Fei Xia and Dieter Fox and Ranjay Krishna},
      year={2025},
      eprint={2505.09990},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.09990}, 
}

Original Source | Taken Source