Point-It-Out (PIO)

Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding

TL;DR: We move embodied reasoning evaluation from multiple-choice answers (A/B/C/D) to visual grounding, testing whether VLMs truly know where to point in realistic embodied scenarios.

Haotian Xue^1,2*, Yunhao Ge², Yu Zeng², Zhaoshuo Li², Ming-Yu Liu², Yongxin Chen^1,2, Jiaojiao Fan²
¹Georgia Tech · ²NVIDIA

📄 Read the paper on arXiv

Introduction • Three-Stage Design • PIO Benchmarks • Demo Inference • Test New Models • Evaluation

Introduction

Most embodied reasoning benchmarks evaluate models via multiple-choice QA (MCQ): the model picks from options like A/B/C/D.
However, real-world agents need to do more than choose an answer — they must ground their reasoning in the scene:

“Where exactly should I act?”
“Which object or location is the right one?”

Point-It-Out (PIO) reframes embodied reasoning as visual grounding in realistic scenarios.
Instead of asking “Which option is correct?”, PIO asks:

“Point to the correct object under certain constraints.”
“Point to the location that best satisfies a task goal.”
“Draw the trajectory that completes the task safely.”

We argue this is a crucial step toward connecting multi-modal language models to the physical world.

Three-Stage Design

PIO consists of three stages, each targeting a different aspect of embodied reasoning:

(S1) Object Reference
Refer to a specific object under different constraints:
- Location (left/right/behind, etc.)
- Object part (handle, lid, door, etc.)
- Appearance attributes
(S2) Task-Centric Grounding
Refer to where to act based on a downstream goal:
- Recommendation (e.g., I am thirsty, point to something that can help)
- Affordance (where to grasp / push / open)
- Next-state prediction (where an object will move or should move)
(S3) Visual Trace Prediction
Predict a continuous trajectory in the image to accomplish a task:
- Draw a motion trace to open/close objects

PIO Benchmarks

PIO currently contains ~600 human-curated questions:

S1: ~230 samples
S2: ~270 samples
S3: ~100 samples

For S1 / S2:

We provide human-annotated ground truth as segmentation masks.

For S3:

We provide a VLM-based judging prompt to evaluate the quality of predicted visual traces (and recommend human evaluation for highest reliability).

Loading S1 / S2 Data

import json
s1s2_data = json.load(open('data/s1s2.json', 'rb'))
for sample in s1s2_data:
    polygon = sample['polygon']        # ground-truth seg mask (COCO-style polygon)
    image_path = sample['image_path']  # relative image path
    height, width = sample['height'], sample['width']
    prompt = sample['lang']            # task prompt
    s1_or_s2 = sample['s1_or_s2']      # 's1' or 's2'
    subclass = sample['subclasses']    # subclass label, e.g. "recommendation"

The root of image_path is:

data/images_s1s2/

Loading S3 Data

import json
s3_data = json.load(open('data/s3.json', 'rb'))
for sample in s3_data:
    image_path = sample['image_path']
    prompt = sample['lang']

The root of image_path is:

data/images_s3/

Demo Inference on Gemini-2.5-pro

We provide ready-to-run demo scripts to perform inference on all three stages (S1/S2/S3) using Gemini-2.5-flash (or any other VLM defined in code/vlms/).

S1 / S2 Demo

Simply run:

EXP_NAME="demo"
MAX_CASES=5  # set to -1 to run on all cases
python code/test_s1s2.py \
  --test_models gemini-2.5-flash \
  --max_cases ${MAX_CASES} \
  --exp_name ${EXP_NAME} \
  --which_s both \
  --save_path results \
  --s1s2_path data/s1s2.json

S3 Demo

EXP_NAME="demo"
MAX_CASES=5  # set to -1 to run on all cases
python code/test_s3.py \
  --test_model gemini-2.5-flash \
  --max_cases ${MAX_CASES} \
  --exp_name ${EXP_NAME} \
  --json_path data/s3.json \
  --image_root data/images_s3 \
  --save_path results

After running these scripts, you will find:

Visualization: vis.png (or similar) containing:
- GT segmentation mask / polygon (S1/S2)
- Model prediction (bbox / points / trajectory)
Raw predictions: info.npy files containing model outputs and metadata.

Example visualizations (top row: S1, middle row: S2, bottom row: S3):

Test New Models

The VLM zoo is defined in:

code/vlms/__init__.py

To plug in a new VLM:

Register the model in get_vlm():
- Add your <model_name> and corresponding <ModelClass>.
Implement your model class under code/vlms/:
- Create new_model.py with a class implementing:
  - __call__(...) – used by default for S1/S2.
  - preprocess_image(...) – optional helper for image processing.
  - s3(...) – used for S3 trajectory prediction.
Add prompts:
- Define your model’s prompt template and set the path via:
```
self.question_template = "prompts/your_model_prompt.txt"
```

We include demo implementations for:

GPT series
Gemini series
MoLMO
RoboRefer
RoboBrain

along with their prompts in the prompts/ directory.

Evaluation

S3 Evaluation

For S3 (visual trace prediction), we provide an automatic VLM-based evaluation prompt in:

prompts/s3_auto.txt

However, we strongly recommend human evaluation for rigorous assessment, especially for safety- and planning-critical tasks.

S1 / S2 Evaluation

For S1 / S2, we provide an evaluation script:

code/eval_s1s2.py

This script:

Loads predictions from results/ (output by test_s1s2.py)
Computes IoU-based metrics against ground-truth segmentation
Prints summary scores (S1, S2, and subclasses) to the console

You can use it to compare different models on the exact same benchmark.

Cite our paper if needed:

If you find PIO useful, please consider starring the repo and citing the paper 💫

@article{xue2025point,
  title={Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding},
  author={Xue, Haotian and Ge, Yunhao and Zeng, Yu and Li, Zhaoshuo and Liu, Ming-Yu and Chen, Yongxin and Fan, Jiaojiao},
  journal={arXiv preprint arXiv:2509.25794},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
code		code
data		data
prompts		prompts
results		results
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Point-It-Out (PIO)

Introduction

Three-Stage Design

PIO Benchmarks

Loading S1 / S2 Data

Loading S3 Data

Demo Inference on Gemini-2.5-pro

S1 / S2 Demo

S3 Demo

Test New Models

Evaluation

S3 Evaluation

S1 / S2 Evaluation

About

Uh oh!

Releases

Packages

Languages

xavihart/PIO

Folders and files

Latest commit

History

Repository files navigation

Point-It-Out (PIO)

Introduction

Three-Stage Design

PIO Benchmarks

Loading S1 / S2 Data

Loading S3 Data

Demo Inference on Gemini-2.5-pro

S1 / S2 Demo

S3 Demo

Test New Models

Evaluation

S3 Evaluation

S1 / S2 Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages