| CARVIEW |
SPIN-Bench
How Well Do LLMs Plan Strategically and Reason Socially?
* Equal Contribution
Introduction
We introduce Strategic Planning, Interaction, and Negotiation (SPIN-Bench), a comprehensive framework for evaluating long-horizon strategic planning and social intelligence in Large Language Models (LLMs). Unlike prior work that confines itself to narrow planning or isolated single-agent tasks, SPIN-Bench combines formal PDDL challenges, competitive board games, cooperative card games, and multi-agent negotiation scenarios within a single evaluation.
By systematically varying action spaces, state complexity, and the number of interacting agents, SPIN-Bench tests not only methodical, step-wise decision-making but also conceptual inference about hidden information and adversarial or cooperative strategies. Our experiments reveal that while contemporary LLMs handle basic fact retrieval and short-range planning reasonably well, they encounter significant performance bottlenecks in tasks requiring deep multi-hop reasoning over large state spaces and socially adept coordination under uncertainty.
In particular, we find that strong models (e.g., o1) can still struggle with
extended-horizon planning when multiple agents and hidden intentions are introduced,
and that extensive social interaction can sometimes degrade chain-of-thought coherence.
These insights highlight persistent gaps in multi-agent negotiation,
alliance formation, and perspective-taking, underscoring where further
advances in LLM architectures and training might be needed.
By drawing on both human baselines and domain-specific solvers, our results shed light on the real-world potential and current shortcomings of LLMs for strategic, multi-agent settings. We envision SPIN-Bench as a catalyst for future research on robust multi-agent planning, social reasoning, and human–AI teaming.
Task Taxonomy and Environments
The SPIN-Bench framework integrates four distinct environment types:
- PDDL Tasks: Classical planning problems across 21 domains (1,280 tasks) spanning factual retrieval, spatial reasoning, and multi-step planning with increasing state spaces.
- Competitive Games: Turn-based board games of escalating complexity (Tic-tac-toe, Connect Four, Chess) that test adversarial reasoning from short-range tactics to deeper strategic thinking.
- Cooperative Games: Featuring Hanabi, a card game where players see others' cards but not their own, requiring trust-building, inference about hidden states, and coordinated actions.
- Strategic Games: Incorporating Diplomacy, where negotiation, alliance formation, and strategic betrayal are integral, testing both planning capabilities and social intelligence.
This structured progression allows us to systematically pinpoint where LLM reasoning breaks down—whether in state tracking, partial-order reasoning, chain-of-thought coherence, or dynamic social interaction. By combining these environments within a unified evaluation framework, SPIN-Bench provides unprecedented insight into how LLMs transition from basic planning to complex multi-agent reasoning.
Model Rankings
We evaluate the performance of various large language models across planning tasks, competitive games, and collaborative scenarios. Here are the top-performing models based on our comprehensive evaluation:
| Rank | Model | Planning avg | Competitive avg | Collaborative avg | Average Score |
|---|
Game Trajectory Visualization
Our benchmark includes a diverse set of games and tasks that test strategic planning and social reasoning. Here are some examples of the game trajectories and tasks that we include in our benchmark:
🏁 PDDL
Classical planning tasks across 21 domains with varying complexity. Given inital state and action space and goal state,
play to reach the goal state.
Try it yourself!
A simple competitive game played on a 3×3 grid, evaluating LLMs' understanding of basic rules, turn-taking, and elementary strategic planning against solvers and other LLMs.
An intermediate strategy game with a 6×7 vertical grid where players drop colored discs, requiring foresight to align four discs while blocking opponents' attempts.
♟️ Chess
A complex strategic board game played on an 8×8 checkered board, testing advanced planning, deep calculation, pattern recognition, and sophisticated decision-making.
🎆 Hanabi
A cooperative card game where players see everyone else's cards but not their own, testing coordination with partial information across teams of 2-5 LLM agents.
A grand strategy game featuring seven European powers, testing negotiation skills, alliance formation, spatial reasoning, and complex strategic planning in a multi-agent environment.
LLM vs Solver Game Trajectories
To establish rigorous baselines, we evaluate LLMs against optimal or near-optimal solvers. These matchups reveal how models perform against mathematically perfect play, highlighting their strategic reasoning capabilities and limitations:
LLMs compete against a perfect Minimax solver that never loses. This tests basic game understanding and ability to achieve draws through optimal play in a theoretically solved game.
LLMs play against the Connect Four solver implementation that can calculate optimal moves for any board position, testing deeper tactical awareness and multi-step planning capabilities.
LLMs face the Stockfish chess engine at different skill levels (0, 5, 10, 15, and 20). Even against reduced-strength engines, this reveals significant gaps in deep calculation.
Game Settings and Evaluation Metrics
The SPIN-Bench Framework
Building on the motivations outlined in our introduction, SPIN-Bench's architecture is organized around three progressively complex problem settings for automated action selection: Classical Planning (single-agent, deterministic), Multi-Agent Games (cooperative or competitive), and Strategic Games (mixed cooperation, competition, and negotiation). Each setting introduces additional layers of complexity, requiring increasingly sophisticated reasoning capabilities.
The framework consists of two core components: (1) the Game Agent, which encompasses the LLMs and their adaptive prompting, and (2) the Environment and Evaluation subsystem, which manages game logic, tracks interactions, and quantifies performance. Our flexible interface feeds models the current state description, relevant history, and legal actions, enabling standardized evaluation across diverse scenarios while maintaining game-specific requirements.
For evaluation, we employ multiple metrics tailored to each environment type. Our rule-based metrics include accuracy and N-Step Look Ahead for planning tasks, move quality comparison against solvers for competitive games, and final scores for cooperative scenarios. We maintain leaderboard-based comparisons with internal Elo ratings to gauge relative performance across models and against human baselines. For negotiation-heavy settings, we utilize six fine-grained, LLM-assisted negotiation metrics that analyze message-strategy alignment, proposal acceptance, deal equity, conflict tendencies, perspective-taking, and conditional negotiation abilities.
Experimental Results
To investigate whether LLMs' planning deficits stem from weaker spatial understanding, we designed tasks requiring each model to track positions across sequences of relative movements. This figure plots the accuracy of each model against the length of the movement trajectory. Notably, o1-mini and GPT-4o exhibit declining performance as the number of steps increases, whereas o1 sustains perfect accuracy (100%) up to 29 steps.
Here, we investigate whether LLMs can reliably retrieve key facts from a planning trajectory. This figure illustrates how retrieval accuracy varies with trajectory length. Notably, o1 performs most consistently, confirming that it "reads" multi-step expansions more accurately than either GPT-4o or o1-mini
In Diplomacy, we design and categorize several factual queries into one-hop vs. multi-hop to further check models' factual retrieval in a highly strategic environment. The figure shows that nearly all LLMs do well on basic location or adjacency checks but degrade by a large margin on "Attackable" and "Attack Analysis," which demand deeper, multi-hop inference. Again, o1 and o1-preview lead, but still exhibit significant drops compared to simpler tasks.
The complete result of draw rates of LLMs playing against solvers in Tic Tac Toe, Connect Four, and Chess. Solvers win or draw all the time, without losing any single match. Tic-tac-toe reveals that advanced LLMs (e.g., o1, GPT-4-turbo, Claude 3.5 Sonnet) can achieve draws some of the time, but typically still lose or draw to the solver. In Connect Four and Chess, the gap widens: solver and stockfish-level engines maintain a 100% win rate across all tested LLMs.
The Top Move distribution shows that while LLMs sometimes pick optimal moves in Connect Four, their accuracy drops drastically in Chess, underscoring how deeper tactics and branching expansions are beyond current LLMs' capacity.
Diplomacy also allows variable numbers of participating powers. Detailed results of more multi-agent settings are shown here. As the agent count grows (beyond 2-3 test seats for LLMs), we observe decreasing order accuracy, fewer successful attacks, and minimal supply-center gains. Ultimately, LLMs lose traction in highly interactive scenarios, underscoring how partial observability and shifting alliances further intensify the multi-agent complexity.
We collected 54,977 human-played Hanabi games from BoardGameGeek, spanning 2- to 5-player settings. This figure plots the human score distribution, highlighting quartiles (Q1--Q4) around a typical range of 15--25 points. While some LLMs do show patterns of declining performance with more agents, none approach even the first quartile of human scores. This underscores the significant gap in cooperative planning under hidden-information constraints—despite Hanabi's narrower branching factor relative to some competitive games.
Share SPIN-Bench
BibTeX
@misc{yao2025spinbenchllmsplanstrategically,
title={SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?},
author={Jianzhu Yao and Kevin Wang and Ryan Hsieh and Haisu Zhou and Tianqing Zou and Zerui Cheng and Zhangyang Wang and
Pramod Viswanath},
year={2025},
eprint={2503.12349},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2503.12349},
}