| 3 |
Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning |
Can Jin, Hongwu Peng, Qixin Zhang, Yujin Tang, Tong Che, Dimitris N. Metaxas |
| 4 |
Exploring Personality Trait Change of LLM-Based AI Systems |
Yuhan Ma, Junjie Wang |
| 5 |
All Life is Problem Creation: Learning to Generate Environments that Maximize Performance Gain
|
Titas Anciukevičius, Yuhui Wang, Piotr Piękos, Li Nanbo, Wenyi Wang, Jürgen Schmidhuber |
| 7 |
UserBench: An Interactive Gym Environment for User-Centric Agents |
Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran
Yao, Shelby Heinecke, Silvio Savarese, Huan Wang |
| 8 |
Beyond Fixed Tasks: Unsupervised Environment Design for Task-Level Pairs |
Daniel Furelos-Blanco, Charles Pert, Frederik Kelbel, Alexander F Spies, Alessandra Russo, Michael
D Dennis |
| 11 |
RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users |
Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya G. Roosta, Tianmin Shu |
| 12 |
Licence to Scale: A Microservice Simulation Environment for Benchmarking Agentic AI |
Christopher Lohse, Adrian Selk, Amadou Ba, Jonas Wahl, Marco Ruffini |
| 13 |
GLEE: A Unified Framework and Benchmark for Language-based Economic Environments |
Eilam Shapira, Omer Madmon, Itamar Reinman, Samuel Joseph Amouyal, Roi Reichart, Moshe Tennenholtz
|
| 18 |
PrivacyMAS: A Privacy-Preserving Multi-Agent System Framework |
Maryam Fatima |
| 20 |
Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models
|
Brennen Hill, Mant Koh En Wei, Jishnuanandh Thangavel |
| 21 |
Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge
Injection in Mobile Automation |
Junyang Wang, Haiyang Xu, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Jitao Sang |
| 22 |
What Limits Agentic Systems Efficiency? |
Song Bian, Minghao Yan, Anand Jayarajan, Gennady Pekhimenko, Shivaram Venkataraman |
| 26 |
Co-Evolving Complexity: An Adversarial Framework for Automatic MARL Curricula |
Brennen Hill |
| 28 |
When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM
Coding Agents |
Matous Kozak, Roshanak Zilouchian Moghaddam, Kalpathy Sivaraman |
| 32 |
The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated
Curriculum |
Brennen Hill |
| 33 |
DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates |
Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia
Yang, Dhavan V. Shah, Robert D. Hawkins, Junjie Hu, Timothy T. Rogers |
| 34 |
MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization |
Yichen Han, Bojun Liu, Zhengpeng zhou, Guanyu Liu, Zeng Zhang, Yang Yang, Wenli Wang, Isaac N Shi,
Yunyan, Lewei He, TIANYU SHI |
| 36 |
BrowseMaster: Towards Scalable Web Browsing via Tool-Augmented Programmatic Agent Pair |
Xianghe Pang, Shuo Tang, Rui Ye, Yuwen Du, Yaxin Du, Siheng Chen |
| 37 |
TutorTest: Evaluating Language Model-based Tutoring Policies Using Surrogate Tasks |
Aishwarya Mandyam |
| 43 |
You Don't Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation
|
Yutong Bian, Xianhao Lin, Yupeng Xie, Tianyang Liu, Mingchen Zhuge, Siyuan Lu, Haoming Tang,
Jinlin Wang, Jiayi Zhang, Jiaqi Chen, Xiangru Tang, Yongxin Ni, Sirui Hong, Chenglin Wu |
| 44 |
Paper2Video: Automatic Video Generation from Scientific Papers |
Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou |
| 45 |
On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems |
Bohan Tang, Huidong Liang, Keyue Jiang, Xiaowen Dong |
| 46 |
SimuGen: Multi-modal Agentic Framework for Constructing Block Diagram-Based Simulation Models |
Xinxing Ren, Qianbo Zang, Zekun Guo |
| 48 |
SEDM: Scalable Self-Evolving Distributed Memory for Agents |
Haoran Xu, Jiacong Hu, ZHANG Ke, Lei Yu, Yuxin Tang, Xinyuan Song, Yiqun Duan, Lynn Ai, TIANYU SHI
|
| 49 |
Similar: A Step-Wise, Multi-Dimensional Reward Model for Virtual Agent Learning and Reasoning |
Bingchen Miao, Yang Wu, Minghe Gao, Qifan Yu, Wendong Bu, Wenqiao Zhang, Yunfei Li, Siliang Tang,
Tat-Seng Chua, Juncheng Li |
| 50 |
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery |
Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Agam Bhatia, Ying Li, Aditi Bhaskar, Mohammed Zaman,
Noah Goodman |
| 53 |
GR-Agent: Adaptive Graph Reasoning Agent under Incomplete Knowledge |
Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Jiaoyan Chen, Steffen Staab, Yuan He,
Evgeny Kharlamov |
| 54 |
A Multi-agent Reasoning Framework for Video Question Answering |
Abhi Kamboj, Gaurav Kumar, Krista Holden, Madhumitha Saravanan, Pradyumna Narayana |
| 59 |
LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra
|
Seth Karten, Wenzhe Li, Zihan Ding, Samuel Kleiner, Yu Bai, Chi Jin |
| 60 |
Ludax: A GPU-Accelerated Domain Specific Language for Board Games |
Graham Todd, Alexander George Padula, Dennis J. N. J. Soemers, Julian Togelius |
| 61 |
Survival of the Useful: Evolutionary Boids as a Sandbox for Agent Societies |
Xisen Wang, Qi Zhang |
| 62 |
EVOLVE-MEM: A Self-Adaptive Hierarchical Memory Architecture for Next-Generation Agentic AI
Systems |
Rishi Ashish Shah, Ujjwal Kakar, Shashvat Singhal, Dinesh K Vishwakarma |
| 67 |
ReMAC: Large Language Model-Driven Reward Design for Multi-Agent Manipulation Collaboration |
Pengyi Li, Hongyao Tang, Yifu Yuan, Jianye HAO |
| 68 |
Revisiting Uncertainty Estimation and Calibration of Large Language Models |
Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Jialin Yu, Philip Torr, Chang Xu |
| 70 |
Vision-Language Models Unlock Task-Centric Latent Actions |
Alexander Nikulin, Ilya Zisman, Albina Klepach, Denis Tarasov, Alexander Derevyagin, Andrei
Polubarov, Lyubaykin Nikita, Vladislav Kurenkov |
| 71 |
AgentyxCrypt: Advancing Privacy and (Secure) Computation in AI Agent Collaboration |
Harish Karthikeyan, Yue Guo, Udari Madhushani Sehwag, Leo de Castro, Antigoni Polychroniadou, Leo
Ardon, Sumitra Ganesh |
| 77 |
Protein Design with Agent Rosetta: A Case Study for Specialized Scientific Agents |
Jacopo Teneggi, Tanya Marwah, Alberto Bietti, P. Douglas Renfrew, Vikram Khipple Mulligan, Siavash
Golkar |
| 79 |
Zephyrus: An Agentic Framework for Weather Science |
Sumanth Varambally, Marshall Fisher, Jas Thakker, Yiwei Chen, Zhirui Xia, Ruijia Niu, Yasaman
Jafari, Veeramakali Vignesh Manivannan, Zachary Novack, Luyu Han, Srikar Eranky, Salva Rühling
Cachay, Taylor Berg-Kirkpatrick, Duncan Watson-Parris, Yian Ma, Rose Yu |
| 80 |
Traxgen: Ground-Truth Trajectory Generation for AI Agent Evaluation |
Maria Emilia Mazzolenis, Ruirui Zhang |
| 82 |
LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training |
Yiming Wang, Da Yin, Yuedong Cui, Zhiqian Li, Ruichen Zheng, Zongyu Lin, Di Wu, Xueqing Wu,
Chenchen Ye, Yu Zhou, Kai-Wei Chang |
| 83 |
DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments |
Chiyu Zhang, Marc-Alexandre Côté, Michael Albada, Anush Sankaran, Jack W Stokes, Tong Wang, Amir
H. Abdi, William Blum, Muhammad Abdul-Mageed |
| 84 |
ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning |
Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Siyu Han, Wen-Da Wei, Guohao Cai, Zhenhua
Dong, Lan-Zhe Guo, Yu-Feng Li |
| 86 |
Characterizing Deep Research: A Benchmark and Formal Definition |
Abhinav Java, Ashmit Khandelwal, Sukruta Prakash Midigeshi, Aaron Halfaker, Amit Deshpande, Navin
Goyal, Ankur Gupta, Nagarajan Natarajan, Amit Sharma |
| 88 |
Examining the Vulnerability of Multi-Agent Medical Systems to Human Interventions for Clinical
Reasoning |
Benjamin Liu, Dillon Mehta, Rishi Malhotra, Adam Zobian, Yong Ying Tan, Samir Chopra, Daniella
Rand, Natalie Pang, Abhiram Gudimella, Raghav Thallapragada, Derek Jiu, Prisha Shah, Kevin Zhu |
| 89 |
IndusGCC: A Data Benchmark and Evaluation Framework for GUI-Based General Computer Control in
Industrial Automation |
Xiaoran Yang, Yuyang Du, Kexin Chen, Soung Chang Liew, Jiamin Lu, Ziyu Guo, Xiaoyan Liu, Qun Yang,
Shiqi XU, Xingyu Fan, Yuchen Pan, Taoyong Cui, Hongyu Deng, Boris Düdder, Jianzhang Pan, Qun Fang,
Pheng-Ann Heng |
| 90 |
Faithful Simulation of User–Agent–Environment Interactions for Scalable LLM Agent Evaluation |
Aleksei Kudrinskii, Saibo Geng, Luca Beurer-Kellner, Marc Fischer |
| 91 |
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
|
Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo,
Silvio Savarese, Caiming Xiong, Junnan Li |
| 94 |
Code2MCP: Transforming Code Repositories into MCP Services |
Chaoqian Ouyang, Ling Yue, Shimin Di, Libin Zheng, Shaowu Pan, Min-Ling Zhang |
| 98 |
WebArena Verified: Reliable Evaluation for Web Agents |
Amine El hattami, Megh Thakkar, Nicolas Chapados, Christopher Pal |
| 99 |
See, Think, Act: Online Shopper Behavior Simulation with VLM Agents |
Yimeng Zhang, Ziyi Wang, Yuxuan Lu, Simon Sinong Zhan, Jing Huang, Dakuo Wang |
| 101 |
Agent Context Protocols Enhance Collective Inference |
Arjun Beniwal, Devansh Bhardwaj, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Karthik R
Narasimhan, Ameet Deshpande, Vishvak Murahari |
| 102 |
Towards Agents That Know When They Don't Know: Uncertainty as a Control Signal for Structured
Reasoning |
Josefa Lia Stoisser, Marc Boubnovski Martell, Lawrence Phillips, Gianluca Mazzoni, Lea Mørch
Harder, Philip Torr, Jesper Ferkinghoff-Borg, Kaspar Märtens, Julien Fauqueur |
| 103 |
Enabling User-Created Multi-Agent Simulations: Interactive and Customizable 2D Environments to
Study Team Dynamics with LLM Agents |
Mohammed Almutairi, Charles Chiang, Haoze Guo, Nandini Banerjee, Maria Milkowski, Daniel Nguyen,
Michael G Yankoski, Tim Weninger, Svitlana Volkova, Trenton W. Ford, Diego Gomez-Zara |
| 105 |
The Influence of Scaffolds on Coordination Scaling Laws in LLM Agents |
Mariana Meireles, Rupali Bhati, Niklas Lauffer, Cameron Allen |
| 108 |
Go-Browse: Training Web Agents with Structured Exploration |
Apurva Gandhi, Graham Neubig |
| 110 |
VendiRL: A Framework for Self-Supervised Reinforcement Learning of Diversely Diverse Skills |
Erik M. Lintunen |
| 112 |
OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human
Online Shopping Behavior Simulation |
Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian
Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang
|
| 113 |
GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning |
Yao Zhang, Yu Wu, Haowei Zhang, Weiguo Li, Haokun Chen, Guohao Li, Zhen Han, Volker Tresp |
| 114 |
When Agents go Astray: Course-Correcting SWE Agents with PRMs |
Shubham Gandhi, Jason Tsay, Jatin Ganhotra, Kiran Kate, Yara Rizk |
| 119 |
Natural Language Grounded Reinforcement Learning for Clinical Decision-Making in Virtual Patient
Simulations |
Niyel Hassan, Benjamin Liu, Jason Tsai, Jeffrey K Jopling, Dana Lin, Edward Melcer, Cara Liebert
|
| 120 |
Scaling Environments for LLM Agents in the Era of Learning from Interaction: A Survey |
Yuchen Huang, Sijia Li, Minghao LIU, Wei Liu, Zhiyuan Fan, Yi R. Fung |
| 121 |
RAISE: Reliable Agent Improvement via Simulated Experience |
Sahar Omidi Shayegan, Joshua Meyer, Victor Shih, Sebastian Sosa, Tianyi Peng, Kostis Kaffes,
Eugene Wu, Andi Partovi, Mehdi Jamei |
| 122 |
Learning to Make Friends: Coaching LLM Agents toward Emergent Social Ties |
Philipp J. Schneider, LIN TIAN, Marian-Andrei Rizoiu |
| 123 |
CoLLAB: A Framework for Designing Scalable Benchmarks for Agentic LLMs |
Saaduddin Mahmud, Eugene Bagdasarian, Shlomo Zilberstein |
| 125 |
Model Context Protocol for Vision Agents: Schema, Memory, and World Model Implications |
Aditi Tiwari, Akshit Bhalla |
| 130 |
MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent
Environments |
Darshan Girish Deshpande, Varun Prashant Gangal, Hersh Mehta, Jędrzej Rosłaniec, Anand Kannappan,
Rebecca Qian, Peng Wang |
| 131 |
Enabling multi-agent collaboration in knowledge graph environments |
Iñaki Arango, Ayush Noori, Lucas Vittor, Joaquin Polonuer, Marinka Zitnik |
| 135 |
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents |
Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song |
| 136 |
MIRAI: Evaluating LLM Agents for International Event Forecasting |
Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, Wei Wang |
| 137 |
PuzzleJAX: A Benchmark for Reasoning and Learning |
Sam Earle, Graham Todd, Yuchen Li, Ahmed Khalifa, Zehua Jiang, Muhammad Umair Nasir, Andrzej
Banburski-Fahey, Julian Togelius |
| 141 |
Agentic Persona Control and Task State Tracking for Realistic User Simulation in Interactive
Scenarios |
Hareeshwar Karthikeyan |
| 146 |
Are LLMs Generalist Hanabi Agents? |
Mahesh Ramesh, Aswinkumar Ramkumar, Pavan Thodima, Kaousheik Jayakumar, Aniket Rege |
| 147 |
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers |
Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza
Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow |
| 150 |
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards |
Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark |
| 152 |
UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs |
Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach |
| 153 |
Steering Diffusion Policies with Value-Guided Denoising |
Hanming Ye |
| 154 |
CUBE: Collaborative Multi-Agent Block-Pushing Environment for Collective Planning with LLM Agents
|
Hanqing Yang, Narjes Nourzad, Shiyu Chen, Carlee Joe-Wong |
| 156 |
Player-Coach Teamwork: Multi-agent Collaboration for Improving LLM Reasoning |
Heewon Park, Minhae Kwon |
| 163 |
Automated Specialization of Stateful Agent Systems |
Myan Vu, Harrish Ayyanar, PANG JIANG, Anwiketh Reddy, Mayank Goel, Kevin Zhu |
| 164 |
Scaling Open-Ended Reasoning to Predict the Future |
Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping |
| 165 |
Verifiable Chemical Reasoning through Tool-Calling Agentic Workflow |
Gabrielle Gaudeau, Shinnosuke Tanaka, Defne Circi, Ian W Kennedy, Movina Moses, Mohab Elkaref |
| 166 |
Fathom-Search-4B: Scaling DeepSearch Reasoning Capabilities via RL |
Shreyas Singh, Kunal Singh, Pradeep Moturi |
| 167 |
SEA: Stateful Execution Environment for Conversational Big Data Analytics |
Rohit Kumar, Ajay Anil Kumar |