| CARVIEW |
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Xinzhuang Xiong1, Hanchong Zhang2, Yuchen Mao1, Wenjing Hu1, Tianbao Xie1, Hongsheng Xu2,
Danyang Zhang12, Sida Wang, Ruoxi Sun3, Pengcheng Yin4, Caiming Xiong5, Ansong Ni6,
Qian Liu7, Victor Zhong8, Lu Chen2, Kai Yu2, Tao Yu1
4Google Deepmind, 5Salesforce Research, 6Yale University, 7Sea AI Lab, 8University of Waterloo
Email to: ruishengcao@gmail.com , tyu@cs.hku.hk
Abstract
Spider2-V Framework Infrastructure
Executable Environment
Task Demonstration
Airbyte and uuid 66936a8e-5cbe-4638-a03a-3ae92eb81e6c) below to showcase:
- 1. .json data format;
- 2. two types of task intructions (abstract and verbose);
- 3. environment setup methods;
- 4. video recording and action trajectory to complete the task;
- 5. task-specific evaluation metric.
.json file (detailed in [Data Format](https://github.com/xlang-ai/Spider2-V/tree/main/evaluation_examples#task-format)):
β’ instruction: the task instruction, user intent, or task goalβ’
config: a list of functions to initialize or reset the environment in the virtual machine. Each function is represented by a JSON dict, where the type field indicates the function name and the parameters field indicates the parameters of the functionβ’
evaluator: the evaluation function to check the agent's output. Concretely, the func field indicates the function name, the result field indicates how to obtain the predicted result from the agent, and the expected field indicates the golden result of the current task{
"id": "66936a8e-5cbe-4638-a03a-3ae92eb81e6c",
"snapshot": "airbyte",
"instruction": "I have established a connection from Faker to local .csv file. Could you help me change the running schedule? I hope it can be replicated at 6:00 pm every day.",
"source": [
"https://docs.airbyte.com/using-airbyte/core-concepts/sync-schedules"
],
"related_apps": [
"chromium",
"airbyte",
"docker"
],
"tags": [
"gui",
"data_ingestion_and_integration",
"abstract"
],
"action_number": 6,
"config": [
{
"type": "copyfile_from_host_to_guest",
"parameters": {
"src": "evaluation_examples/examples/airbyte/66936a8e-5cbe-4638-a03a-3ae92eb81e6c/connection.json",
"dest": "/home/user/connection.json"
}
},
{
"type": "script_and_execute",
"parameters": {
"src": "evaluation_examples/examples/airbyte/66936a8e-5cbe-4638-a03a-3ae92eb81e6c/init.sh",
"dest": "/home/user/init.sh"
}
},
{
"type": "google_chrome_browser",
"parameters": {
"debugging_port": 1337,
"listening_port": 9222,
"urls": [
"https://www.bing.com/"
]
}
},
{
"type": "airbyte_webui_init",
"parameters": {
"listening_port": 9222,
"url": "https://localhost:8000",
"actions": [
{
"type": "login",
"email": "anonym@gmail.com",
"company": "ANONYM"
}
]
}
}
],
"evaluator": {
"postconfig": [],
"func": "check_include_exclude",
"result": {
"type": "vm_script_output",
"src": "evaluation_examples/examples/airbyte/66936a8e-5cbe-4638-a03a-3ae92eb81e6c/eval.sh",
"dest": "/home/user/eval.sh"
},
"expected": {
"type": "rule",
"rules": {
"include": [
"succeed"
],
"exclude": [
"failed"
]
}
}
},
"counterpart": "7657611f-2e32-47a1-89c9-3b887d803bc5"
}
Airbyte and uuid 66936a8e-5cbe-4638-a03a-3ae92eb81e6c) has an abstract instruction, which means it only gives a brief or high-level description of the task without stepwise guidance.
Abstract Instruction
Airbyteand uuid 7657611f-2e32-47a1-89c9-3b887d803bc5, the instruction is verbose, which means it also provides a detailed step-by-step guidance on how to finish the task.
Verbose Instruction
1) Click the connection row whose name is "Sample Data (Faker) -> Local CSV" in the main panel;
2) Next, click the "Replication" item on the right of "Status" and "Job History";
3) We can see a panel with name "Configuration". Click this panel, we will see two rows called "Schedule type" and "Replication frequency";
4) To set the schedule as 6:00 p.m. every day, firstly we need to change the schedule type. In the drop-down options on the right, select the schedule type "Cron" instead of "Scheduled";
5) One more thing is to input the value "0 0 18 * * ?" into the cron expression box. After that, you should also find there is one phrase "At 06:00 PM" under the input box;
6) Finally, click the button called "Save changes" at the bottom right of this web page. The schedule is successfully altered.
Airbyte task with uuid 66936a8e-5cbe-4638-a03a-3ae92eb81e6c, we need to invoke the following environment setup functions sequentially:
{
"type": "copyfile_from_host_to_guest",
"parameters": {
"src": "evaluation_examples/examples/airbyte/66936a8e-5cbe-4638-a03a-3ae92eb81e6c/connection.json",
"dest": "/home/user/connection.json"
}
}
{
"type": "script_and_execute",
"parameters": {
"src": "evaluation_examples/examples/airbyte/66936a8e-5cbe-4638-a03a-3ae92eb81e6c/init.sh",
"dest": "/home/user/init.sh"
}
}
{
"type": "google_chrome_browser",
"parameters": {
"debugging_port": 1337,
"listening_port": 9222
}
}
{
"type": "airbyte_webui_init",
"parameters": {
"listening_port": 9222,
"url": "https://localhost:8000",
"actions": [
{
"type": "login",
"email": "anonym@gmail.com",
"company": "ANONYM"
}
]
}
}
For a quick glance on more task exmaples, please refer to the [ Task Viewer](explorer.html) page.
Task Instruction
Video Recording
Action Trajectory
## Action 1
index_80 = (417, 288)
pyautogui.click(index_80)
time.sleep(1)
## Action 2
index_83 = (502, 307)
pyautogui.click(index_83)
time.sleep(1)
## Action 3
index_91 = (883, 404)
pyautogui.click(index_91)
time.sleep(1)
## Action 4
index_102 = (1130, 481)
pyautogui.click(index_102)
time.sleep(1)
## Action 5
index_121 = (1130, 782)
pyautogui.click(index_121)
time.sleep(1)
## Action 6
index_98 = (1130, 430)
pyautogui.click(index_98)
time.sleep(1)
## Action 7
index_105 = (1130, 560)
pyautogui.click(index_105)
time.sleep(1)
## Action 8
index_103 = (1050, 481)
# Clear the current cron expression
pyautogui.click(index_103)
pyautogui.hotkey('ctrl', 'a')
pyautogui.press('backspace')
time.sleep(1)
# Enter the new cron expression
pyautogui.typewrite('0 18 * * *')
time.sleep(1)
## Action 9
index_103 = (1050, 481)
# Clear the current cron expression
pyautogui.click(index_103)
pyautogui.hotkey('ctrl', 'a')
pyautogui.press('backspace')
time.sleep(1)
# Enter the new Quartz cron expression
pyautogui.typewrite('0 0 18 * * ?')
time.sleep(1)
## Action 10
index_134 = (1426, 834)
pyautogui.click(index_134)
time.sleep(1)
## Action 11
DONE
In the current task, we adopt the information-based metric to check whether the schedule is correctly altered to 0 0 18 * * * or 0 0 18 * * ?.
Data Statistics and Comparison
βverboseβ means a step-by-step guideline on how to complete the task is included in the instruction.
Key statistics of Spider2-V.
Based on task categories and professional applications to showcase the content intuitively.
Distribution of tasks in Spider2-V
**The headers indicate:** the research field (Field), whether an executable environment is provided (Exec. Env.?), whether enterprise service is utilized (Enter. Serv.?), whether GUI actions are supported (GUI Support?) and some other statistics (e.g., number of involved applications or websites, number of execution-based evaluation functions).
| Spider2-V | |
|---|---|
| Field | Data Science &Engineering |
| # Tasks | 494 |
| Exec. Env. ? | βοΈ |
| Enter. Serv.? | βοΈ |
| GUI Support? | βοΈ |
| # Apps/Sites? | 20 |
| # Exec. Eval. Func. | 151 |
| Spider1.0 | DS1000 | Arcade | Intercode | SheetCopilot | MLAgentBench | SWEBench | Mind2Web | WEBLINX | GAIA | WebArena | WorkArena | OSWorld | AitW | AndroidWorld |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Text-to-SQL | Data Science | Data Science | Data Science | Sheet Coding | Machine Learning | Software Engineering | Web | Web | Web | Web | Web | Computer Control | Android | Android |
| 1034 | 1000 | 1082 | 1350 | 221 | 13 | 2294 | 2000 | 2337 | 466 | 812 | 29 | 369 | 30k | 116 |
| β | β | β | βοΈ | β | βοΈ | β | β | β | β | βοΈ | βοΈ | βοΈ | β | βοΈ |
| β | β | β | β | β | β | β | β | β | β | β | βοΈ | β | β | β |
| β | β | β | β | β | β | β | βοΈ | βοΈ | β | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| 1 | 1 | 1 | 3 | 1 | 4 | 12 | 137 | 155 | n/a | 5 | 1 | 9 | 357 | 20 |
| 0 | 0 | 0 | 3 | 0 | 13 | 1 | 0 | 0 | 0 | 5 | 7 | 134 | 0 | 6 |
Benchmarking
| Rank | Model | Details | Score |
|---|---|---|---|
|
1 Jan 16, 2025 |
Learn-by-interact
Google Cloud |
SoM + EF + RAG t=1.0, top-p=0.9 len = 200k |
16.6 |
|
2 Jun 3, 2024 |
GPT-4V (1106)
OpenAI OpenAI, '23 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
14.0 |
|
3 Jun 2, 2024 |
GPT-4o (0513)
OpenAI OpenAI, '24 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
13.8 |
|
4 Jun 5, 2024 |
Gemini-Pro-1.5
|
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
9.1 |
|
5 June 6, 2024 |
Claude-3-Opus
AnthropicAI Anthropic, '24 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 200k |
8.1 |
|
6 June 6, 2024 |
Llama-3-70B
Meta Meta Llama, Meta, '24 |
a11ytree + EF + RAG t=1.0, top-p=0.9 len = 32k |
2.0 |
|
7 June 6, 2024 |
Mixtral-8x7B
MistralAI Jiang et al., '24 |
a11ytree + EF + RAG t=1.0, top-p=0.9 len = 32k |
0.8 |
|
8 June 6, 2024 |
Qwen-Max
Qwen Qwen Team, '24 |
a11ytree + EF + RAG t=1.0, top-p=0.9 len = 32k |
0.6 |
βAbstractβ means the instruction only gives the high-level goal of the task without detailed steps.
| Rank | Model | Details | Score |
|---|---|---|---|
|
1 Jun 3, 2024 |
GPT-4V (1106)
OpenAI OpenAI, '23 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
11.3 |
|
1 Jun 2, 2024 |
GPT-4o (0513)
OpenAI OpenAI, '24 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
11.3 |
|
3 Jun 5, 2024 |
Gemini-Pro-1.5
|
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
6.1 |
|
4 June 6, 2024 |
Claude-3-Opus
AnthropicAI Anthropic, '24 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 200k |
5.3 |
βVerboseβ means the instruction also gives a detailed step-by-step guidance on how to finish the task.
| Rank | Model | Details | Score |
|---|---|---|---|
|
1 Jun 3, 2024 |
GPT-4V (1106)
OpenAI OpenAI, '23 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
16.6 |
|
2 Jun 2, 2024 |
GPT-4o (0513)
OpenAI OpenAI, '24 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
16.2 |
|
3 Jun 5, 2024 |
Gemini-Pro-1.5
|
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
12.1 |
|
4 June 6, 2024 |
Claude-3-Opus
AnthropicAI Anthropic, '24 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 200k |
10.9 |
βAccountβ means authentic user accounts (e.g., BigQuery, Snowflake) are needed to finish tasks in this split.
| Rank | Model | Details | Score |
|---|---|---|---|
|
1 Jun 3, 2024 |
GPT-4V (1106)
OpenAI OpenAI, '23 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
11.2 |
|
2 Jun 2, 2024 |
GPT-4o (0513)
OpenAI OpenAI, '24 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
10.6 |
|
3 Jun 5, 2024 |
Gemini-Pro-1.5
|
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
8.8 |
|
4 June 6, 2024 |
Claude-3-Opus
AnthropicAI Anthropic, '24 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 200k |
5.9 |
βNon-accountβ means authentic user accounts are not needed, or tasks in this split can be completed in local host.
| Rank | Model | Details | Score |
|---|---|---|---|
|
1 Jun 2, 2024 |
GPT-4o (0513)
OpenAI OpenAI, '24 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
15.6 |
|
2 Jun 3, 2024 |
GPT-4V (1106)
OpenAI OpenAI, '23 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
15.4 |
|
3 Jun 5, 2024 |
Gemini-Pro-1.5
|
SoM + EF + RAG t=1.0, top-p=0.9 len = 128k |
9.3 |
|
3 June 6, 2024 |
Claude-3-Opus
AnthropicAI Anthropic, '24 |
SoM + EF + RAG t=1.0, top-p=0.9 len = 200k |
9.3 |
Analysis
Acknowledgement
We thank Yiheng Xu, Hongjin Su, Xiaochuan Li, and Toh Jing Hua for their helpful assistance and feedback on this work.
FAQ
Where to download the resources?
The Github repository, virtual machine snapshots, crawled documents can be downloaded from:
- Github repository: Spider2-V (including environment and task examples)
- VM snapshots: ubuntu-arm.zip or ubuntu-x86.zip
- Crawled documents docs.zip
What is the username and password for the virtual machines?
The username and password for the virtual machines are as follows:
- Username: user
- Password: password
How to tackle task examples requiring accounts?
See Account Guideline.
How can I configure a proxy for the VM if I'm behind a GFW?
See Proxy Guideline.
I still have problems when using Spider2-V, where can I find support?
You can put forward an issue on the Github repository or email to ruishengcao@gmail.com , tyu@cs.hku.hk .
BibTeX
@article{2024-spider2v,
title={Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?},
author={Ruisheng Cao and Fangyu Lei and Haoyuan Wu and Jixuan Chen and Yeqiao Fu and Hongcheng Gao and Xinzhuang Xiong and Hanchong Zhang and Yuchen Mao and Wenjing Hu and Tianbao Xie and Hongshen Xu and Danyang Zhang and Sida Wang and Ruoxi Sun and Pengcheng Yin and Caiming Xiong and Ansong Ni and Qian Liu and Victor Zhong and Lu Chen and Kai Yu and Tao Yu},
year={2024},
journal={CoRR},
volume={abs/2407.10956},
eprint={2407.10956},
eprinttype={arXiv},
url={https://arxiv.org/abs/2407.10956}
}