| CARVIEW |
Progress in visual reasoning has been slower than text-based reasoning.
The progress of LLMs in text-based reasoning in 2024-2025 has been remarkable. In the course of a year, we saw LLMs go from general-purpose chatbots to solving competition-level math problems, writing sophisticated code in large repositories, and answering expert-level questions in the sciences. This rate of progress is apparent when comparing benchmark results from different model releases. For example, GPT4o achieved just 15.0% on the math problems in AIME 2025, but OpenAI's later model release of GPT-5-mini achieves a staggering 94.0%. Similar results are seen on software engineering benchmark LiveCodeBench and science VQA benchmark GPQA.
Unfortunately, progress in visual reasoning has been significantly slower. Comparing the same two models, the improvements across benchmarks are much more modest: +7.3% on BLINK, +3.5% on VSR, and a slight decrease on CountBenchQA. Note, however, that in absolute terms most of the visual benchmarks are easier: GPT4o already attains relatively high accuracy on BLINK, VSR, and CountBenchQA, whereas its performance on the text-only benchmarks is substantially lower. On the harder 3D spatial reasoning benchmark Omni3D-Bench, accuracy is still quite low: GPT-5-mini reaches 40.9%, only 5.9% above GPT-4o. This discrepancy is not surprising; visual reasoning is a challenging task. It requires precise object grounding and understanding complex spatial relationships, both of which remain challenging for current models.
Methods to improve visual reasoning broadly fall into two categories. The first integrates grounding with language reasoning, where vision-language models (VLMs) generate chain-of-thought explanations in text. Examples include Thinking with Images [1], GRIT [2], and Visually Grounded RL [3]. These methods can handle simple spatial relations, but suffer from weak visual understanding and logical errors. For instance, in the above example, GPT-5-Thinking ignores real-world 3D object sizes and considers only pixel-wise dimensions, incorrectly concluding the coffee table is six times shorter than the sofa. These methods are also data-hungry, requiring extensive supervision.
Another line of work uses LLMs for program synthesis with vision specialists. Examples include VADAR [4], VisProg [5], and ViperGPT [6]. These training-free approaches rely on proprietary LLMs and pre-trained specialists that are poorly aligned for visual and spatial reasoning.
The VALOR Framework
We introduce VALOR, a scalable; annotation-free training framework that tackles spatial reasoning from images by combining LLM-powered reasoning with specialized tool use. VALOR employs an LLM to generate plans and executable programs and invokes vision specialists for execution. Both the reasoning and the vision grounding model are tuned for the task via a label-free training paradigm. This is achieved by leveraging multimodal verifiers that critique model outputs. Their feedback serves as a learning signal to improve both components, the LLM responsible for logic and the vision specialists responsible for grounding. We name our approach VALOR as it integrates Verifiers for Annotation-free LOgic and Reasoning.
Plan and Code Generation. Given a query, the LLM generates a natural language plan followed by a corresponding program in Python. Available to the LLM are the APIs of three function calls:
-
GD_DETECT, returns the bounding box of all object instances specified by the noun description — e.g.,GD_DETECT("CAR")using a GroundingDINO model [7]. -
DEPTH, returns the depth of a pixel in the image —DEPTH(IMAGE, X, Y)using MoGe2 [8]. -
VQA, returns an object’s attribute (e.g., color) from the input image crop around the object — e.g.,VQA(IMAGE_CROP, "WHAT IS THE COLOR OF THE OBJECT IN THE IMAGE?")using GPT-5-mini [9].
Improving Reasoning with LLM Verifiers
VALOR leverages LLM verifiers as a reward signal to improve reasoning via reinforcement learning. LLM verifiers critique model outputs across a rubric of six criteria targeting specific aspects of spatial reasoning.
Our reward is composed of six binary rewards:
-
Format Reward ($r_{\mathrm{fmt}}$): Ensures model outputs are properly
formatted. Format reward is 1 if the model output contains the proper
<plan>...</plan>and<answer>...</answer>tags, and 0 otherwise. - Syntax Reward ($r_{\mathrm{sn}}$): Evaluates if the predicted program executes without Python errors. Syntax reward is 1 if the program executes properly with placeholder variables, 0 otherwise.
- Spatial Reward ($r_{\mathrm{sp}}$): An LLM verifies that the predicted plan addresses all spatial relationships in the query (above, behind, left of, etc.). The verifier returns 1 if all spatial relationships are addressed, 0 otherwise.
- Attribute Reward ($r_{\mathrm{att}}$): An LLM verifier assesses whether the plan specifies explicitly and correctly how to compute all relevant attributes (height, color, etc.) in the query. The LLM returns 1 if all attributes are computed correctly, otherwise 0.
- Logic Reward ($r_{\mathrm{log}}$): An LLM verifier is given the query and predicted plan, it returns 1 if it considers the plan reasonable and coherent for the given query, 0 otherwise.
- Adherence Reward ($r_{\mathrm{ad}}$): The predicted plan and code are given to an LLM verifier, which returns 1 if the code faithfully implements the plan without deviations, 0 otherwise.
Our final reward is:
\[ R(q,p,c) = r_{\mathrm{fmt}}(p,c) \cdot \big [ \lambda_{\mathrm{sn}}\, r_{\mathrm{sn}}(c) + \lambda_{\mathrm{log}}\, r_{\mathrm{log}}(q,p) + \lambda_{\mathrm{att}}\, r_{\mathrm{att}}(q,p) + \lambda_{\mathrm{sp}}\, r_{\mathrm{sp}}(q,p) + \lambda_{\mathrm{ad}}\, r_{\mathrm{ad}}(p,c)\big ] \]The format reward $r_{\text{fmt}}$ acts as a hard constraint and is applied as a multiplier, while the weighted sum of the remaining rewards evaluates content quality. All $r_k ∈ \{0, 1\}$ and $\sum_k \lambda_k = 1.0$.
Improving Visual Grounding with VLM Verifiers
In addition to logic, visual reasoning relies on accurate grounding. Modern detectors like GroundingDINO, trained on web data, are error-prone and struggle to generalize beyond their training domains. Fine-tuning with domain-specific labels can mitigate these issues, but collecting such annotations is labor intensive. We propose an alternative: improving visual grounding through VLM verifiers. Vision specialists cast predictions, VLM verifiers evaluate them, and the feedback augments their training set. This approach requires no manual annotations and scales across domains without additional labels.
Our approach for verifier-improved visual grounding relies on image-query pairs $\{(I_j,
q_j)\}^M_{j=1}$. For each query $q_j$, our LLM reasoning model generates a plan and code,
$(p_j, c_j)$. From code $c_j$, we parse all grounding queries – e.g.,
GD_DETECT(“HELMET”) – and execute them with a pre-trained detector. To ensure
high recall, we lower the detector’s confidence threshold. This leads to overprediction,
which we validate with a frozen VLM verifier in three steps:
- Coarse Filtering: The input image with all candidate detections is passed to the verifier, which is prompted to discard all boxes where the object does not match the box label.
- Per-crop Object Check: The verifier is given a cropped image of each remaining box and asked to verify if the object visible in the crop matches the predicted label. All incorrect boxes are discarded.
- Deduplication: The input image with all remaining detections is shown to the verifier, which is tasked with discarding all duplicate predictions. For each set of duplicates, the VLM is asked to retain the most correct box.
Confirmed detections form a new training set, which we use to fine-tune a pre-trained GroundingDINO detector.
Inference
The predicted Python programs, that invoke our vision-specialist APIs, are executed to produce answers to visual reasoning queries.
Aside: can trained models ever outperform the verifiers?
A natural question to ask is whether VALOR can ever outperform the verifiers it uses during training. To this end, we first note that VALOR uses multimodal verifiers to select and critique data, not generate it. Thus, VALOR is not bound by the generation abilities of an LLM/VLM, but rather it's verification ability. This distinction is important as we find there are tasks where VLMs are better verifiers than generators. For a concrete example, in VALOR, we use GPT-5-mini as our VLM verifier for improving the visual grounding module. Although highly effective at evaluating object detections, we observe that it often struggles generating bounding boxes itself. In the figure above, we find that GPT-5-mini frequently outputs misaligned or overly large boxes, failing to localize objects that VALOR (trained with GPT-5-mini as a verifier) correctly detects. We find that a VLM can provide reliable binary judgments about correctness even when its own grounding predictions are imperfect.
Model Comparisons
We compare VALOR to a series of models on visual reasoning benchmarks below.
Benchmark Evaluations
We evaluate a series of open-source models, as well as VALOR, across a wide range of spatial reasoning benchmarks. Each LLM is used language-only, and is prompted to generate Python programs that can invoke an API of vision specialist models (detection, depth estimation, VQA), as described above. We execute the generated programs to determine accuracy on each of the benchmarks.
How do open-source models perform?
Among the open-source models we evaluate (Llama3.2-11B, Gemma3-12B,and Qwen3-8B), Qwen3 consistently performs the best. We find that despite using the instruction-tuned variants, Gemma3 and Llama3.2 routinely ignore our system prompts. For example, both models frequently overwrite the input image path, define "placeholder" values, or argue the query is impossible and refuse to answer altogether. In contrast, Qwen3 consistently produces reasonable programs, but incorrectly handles nuanced details in the query and fails to use tools effectively. We feel these are issues that can be addressed via post-training, so we build on the capable Qwen3 model for VALOR.
Qwen3 vs VALOR-RL: Training with verifiers improves model reasoning.
We compare VALOR-RL with Qwen3 to isolate the impact of verifier-improved reasoning. VALOR-RL uses a verifier-trained Qwen3 model with the same vision specialist models. Thus any improvements from Qwen3 to VALOR-RL stem from our LLM-verifier guided training. VALOR-RL shows gains over Qwen3: +3.4% on BLINK, +2.1% on VSR, and +1.3% on RoboSpatial. Most notably, VALOR-RL greatly improves on Omni3D-Bench (+6.4%), our most reasoning-intensive benchmark. On counting tasks TallyQA and CountBenchQA, reasoning is less critical, and VALOR-RL matches Qwen3.
VALOR-RL vs VALOR: Training with verifiers improves visual grounding.
In the above plot we compare VALOR, our final method, to VALOR-RL. The two variants execute identical programs, though VALOR uses the verifier-improved visual grounding module. VALOR yields strong gains across the board, particularly on grounding-focused benchmarks: +8.3% on CountBenchQA, +7.7% on RoboSpatial, and +5.3% on VSR. Improvements on Omni3D-Bench are smaller, as complex queries make reasoning the main challenge for smaller LLMs. Notably, improving visual grounding for spatial reasoning does not harm general object detection; our training slightly boosts performance on the COCO validation set: 48.4% to 48.7% mAP.
Conclusion
We introduce VALOR, an annotation-free training paradigm for visual reasoning that leverages multimodal verifiers to improve LLM reasoning and visual grounding, leading to significant improvements on a wide range of spatial reasoning benchmarks. We find that VLMs/LLMs are increasingly capable verifiers, not merely generators. In fact, we find there are tasks where they are excellent verifiers but not great generators (e.g. object detection). This suggests an alternative method to improving reasoning in the visual domain: leveraging the multimodal verification capabilities of these models to enable training in domains where ground truth is unavailable.
AcknowledgementsWe thank Aadarsh Sahoo, Ilona Demler, and Ziqi Ma for their feedback on the project. The project is funded by Meta through the LLM evaluation research grant and partly through Caltech’s CAST program. We also thank Google’s Gemma Academic program for granting us API credits for their LLMs.
References- OpenAI. Thinking with Images. URL: https://openai.com/index/thinking-with-images/
- Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. in NeurIPS, 2025.
- Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning. in NeurIPS, 2025.
- Damiano Marsili, Rohun Agrawal, Yisong Yue, and Georgia Gkioxari. Visual agentic ai for spatial reasoning with a dynamic api. in CVPR, 2025.
- Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. in CVPR, 2023.
- Didac Suris, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. in ICCV, 2023.
- Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
- Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.. in CVPR, 2025.
- OpenAI. GPT-5-mini. URL: https://openai.com/index/introducing-gpt-5/
BibTeX
@misc{marsili2025labelsproblemtrainingvisual,
title={No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers},
author={Damiano Marsili and Georgia Gkioxari},
year={2025},
eprint={2512.08889},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.08889},
}