HOME
ABOUT
- RESULTS
- differences
- BENEFITS
- HISTORY
- TEAM
- LOCATION
- FACILITIES
- BANKING
- MEMBERSHIPS
- APPROVALS
- LICENCES
- SUPPLIERS
- SPONSORSHIPS
- MEDIA
- PRIVACY
AUCTIONS
SHIPPING
FEES
- TS REWARDS
TOOLS
guides
FAQ
CONTACT
- CONNECT

VEHICLES
BRAND
- JAPANESE CARS
  - DAIHATSU
  - EUNOS
  - FORD
  - HONDA
  - ISUZU
  - LEXUS
  - MAZDA
  - MITSUBISHI
  - MITSUOKA
  - NISSAN
  - SUBARU
  - SUZUKI
  - TOYOTA
- GERMAN CARS
- AMERICAN CARS
- BRITISH CARS
- ITALIAN CARS
- FRENCH CARS
- SWEDISH CARS
- KOREAN CARS
TYPE
- mobility
- VENDING
- instruction
- TAXIS
- AMBULANCES
- FIRE ENGINES
- HEARSES
- LIMOUSINES
- COMMERCIAL
CLASS
FUEL
TRUCKS
minitrucks
- DAIHATSU
- HONDA
- MAZDA
- MITSUBISHI
- NISSAN
- SUBARU
- SUZUKI
- DUMP
- CRANE
- CAMPER
- REFRIGERATED
- 4WD
- NEW
BUSES
MOTORHOMES
- YAHOO!
- RAKUTEN
- DEALER

PARTS
- FREE REPORT
- PARTS CONTAINERS
- PARTS SYSTEMS
- PARTS PROTECTION
- BODY SHELLS
- DISMANTLING
- ONLINE PARTS
- NEW PARTS
- INTERIOR PARTS
- EXTERIOR PARTS
  - BONNETS
  - BUMPERS
  - GRILLES
  - FENDERS
  - DOORS
  - TRUNKS
  - SPOILERS
  - LIGHTS
  - EMBLEMS
  - CAMERAS
- ENGINES
- TRANSMISSIONS
- WHEELS & TYRES
  - WHEELS
  - TYRES
CUTS
PERFORMANCE PARTS
TRUCK PARTS
MOTORBIKE PARTS
- MOTORBIKE ENGINES
- MOTORBIKE ACCESSORIES

MOTORBIKES
MARINE
FORKLIFTS
MACHINERY
AGRICULTURAL
OTHER
COUNTRY
- AUSTRALIA
- CANADA
- KENYA
- MYANMAR
- NEW ZEALAND
- PAKISTAN
- TANZANIA
- UNITED STATES

CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://open-compass.github.io/GTA/ x-github-request-id: 9623:2D8B9D:8D0D4D:9E5C2F:69527B34 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 12:59:32 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210046-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767013173.553219,VS0,VE201 vary: Accept-Encoding x-fastly-request-id: a08f159a35e029056a25448b006dcf60e4d8e2c7 content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Fri, 28 Mar 2025 16:01:53 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"67e6c7f1-6ee2" expires: Mon, 29 Dec 2025 13:09:32 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 5EE3:21D6A4:8EB3E6:A00278:69527B33 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 12:59:33 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210046-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767013173.783342,VS0,VE225 vary: Accept-Encoding x-fastly-request-id: caa3fbd5ce31ff88fbd0dc055bb437da98320146 content-length: 7267 GTA: A Benchmark for General Tool Agents

GTA: A Benchmark for General Tool Agents

Jize Wang^1,2, Zerun Ma², Yining Li², Songyang Zhang², Cailian Chen¹, Kai Chen^2*, Xinyi Le^1*

¹ Shanghai Jiao Tong University, ² Shanghai AI Laboratory
^*Corresponding Authors

arXiv Code Hugging Face Dataset

Abstract

In developing general-purpose agents, significant focus has been placed on integrating large language models (LLMs) with various tools. This poses a challenge to the tool-use capabilities of LLMs. However, there are evident gaps between existing tool evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only inputs, which fail to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries : human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools : an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs : authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for the advancement of general-purpose tool agents.

GTA Design

GTA is a benchmark to evaluate the tool-use capability of LLM-based agents in real-world scenarios. It features three main aspects:

Real user queries. The benchmark contains 229 human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps.
Real deployed tools. GTA provides an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance.
Real multimodal inputs. Each query is attached with authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely.

The questions in GTA are all created by human. The comparison of GTA queries with AI-generated queries is shown in the table. The steps and tool types for queries in ToolBench and m&m's are explicitly stated, as marked in red and blue. The queries in APIBench are simple, only containing one step. Our GTA's queries are both step-implicit and tool-implicit, while based on real-world scenarios.

Question Examples

Here are some question examples of GTA. All questions are tool-implicit, step-implicit and contains multimodal context inputs. They are easy-to-understand questions with clear goals, based on real-world scenarios, helpful for humans while complex for AI assistants to solve. The JSON format data example is available at Hugging Face.

Dataset Construction

Two steps are performed in the dataset construction pipeline.

Query construction. Initial exemplars and instruction documents are designed by experts and given to human annotators. Annotators brainstorm and design more samples based on the exemplars.
Tool chain construction. Annotators manually call the deployed tools to check the executability of each query in the query set. Then they annotate the ground truth tool chains for each query.

🏆 GTA Leaderboard

Notes

Models labeled with 🔶 are API-based models, while others are open-source models.
Refer to Github to evaluate models on GTA.

BibTeX

@misc{wang2024gtabenchmarkgeneraltool,
        title={GTA: A Benchmark for General Tool Agents}, 
        author={Jize Wang and Zerun Ma and Yining Li and Songyang Zhang and Cailian Chen and Kai Chen and Xinyi Le},
        year={2024},
        eprint={2407.08713},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2407.08713}, 
  }

This page was built using the Academic Project Page Template which was adopted from the Nerfies project page. You are free to borrow the of this website, we just ask that you link back to this page in the footer.
This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

HOME
ABOUT
AUCTIONS
SHIPPING
FEES
TOOLS
HOW
FAQ
CONTACT

Original Source | Taken Source