| CARVIEW |
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback
To appear at ICLR 2024
Abstract
To solve complex tasks, large language models (LLMs) often require multiple rounds of
interactions with the user, sometimes assisted by external tools.
However, current evaluation protocols often emphasize benchmark performance with single-turn
exchanges, neglecting the nuanced interactions among the user, LLMs, and external tools,
while also underestimating the importance of natural language feedback from users. These
oversights contribute to discrepancies between research benchmark evaluations and real-world
use cases.
We introduce MINT, a benchmark that evaluates LLMs' ability to solve tasks with multi-turn
interactions by (1) using tools and (2) leveraging natural language feedback.
To ensure reproducibility, we provide an evaluation framework where LLMs can access tools by
executing Python code and receive users' natural language feedback simulated by GPT-4.
We repurpose a diverse set of established evaluation datasets focusing on reasoning, coding,
and decision-making and carefully curate them into a compact subset for efficient
evaluation.
Our analysis of 20 open- and closed-source LLMs offers intriguing findings.
- (a) LLMs generally benefit from tools and language feedback, with performance gains (absolute, same below) of 1-8% for each turn of tool use and 2-17% with natural language feedback.
- (b) Better single-turn performance does not guarantee better multi-turn performance.
- (c) Surprisingly, on the LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities.
We expect MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation can be less accessible compared to commercial LLMs with a larger user base.
Interaction Framework
MINT mirrors the real-world User-LLM-Tool collaborative problem-solving setting. To solve a problem, the LLM can (1) use external tools by generating and executing Python programs and/or (2) collecting natural language feedback to refine its solutions; the feedback is provided by GPT-4, aiming to simulate human users in a reproducible and scalable way.
- We measure LLMs' tool-augmented task-solving capability by analyzing its performance gain with increased numbers of turns without language feedback (i.e., no red dotted box in the figure below).
- We quantify LLMs' ability to leverage natural language feedback with the performance gain upon receiving GPT-4 generated feedback (i.e., performance without and with red dotted box in the figure below).
Evaluation
We evaluate 20 LLMs where 4 are closed- and 16 are open-source. We cover different sizes and training techniques to better understand how they affect LLMs' multi-turn interaction capability. We consider three variants of training techniques:
- Base: Pre-trained model
- SIFT: Supervised Instruction-Finetuning
- RLHF: Reinforcement Learning from Human Feedback
Tool-augmented Task-Solving capabilities of LLMs
-
We find all open-source models fall behind most commercial closed-source models in
both success
rate
at k=5 and improvement rate (slope).
-
Absolute performance and improvement-per-turn (e.g., slope) scale with model size.
-
SIFT on multi-turn data can potentially be helpful. Vicuna-v1.5
(7B), which is a SIFT variant of LLaMA2 trained on ShareGPT conversations
(most are multi-turn), exhibit stronger performance compared to LLaMA-2 (Base and
RLHF)1.
We observe similar trend for Lemur-70b-chat-v1, which continue
pre-train LLaMA-2 (70B) on code intensive data followed by SIFT on multi-turn data.
-
We find RLHF hurt LLM-tool multi-turn interaction on LLaMA-2 series. However, it's
unclear if RLHF is problematic overall, or if the issue only arise when RLHF is
primarily applied to
single-turn data.
- We find some performance degradation in Vicuna-v1.5 (especially for the 13B one), potential due to training artifacts. We refer to paper Section 3.5 for more details.
LLMs' Ability to Leverage Natural Language Feedback
-
We find no significant difference between open- and closed-source models in terms of
Δfeedback.
-
Similar to previous findings, we find that SIFT and RLHF hurt models' ability to
leverage feedback on CodeLLama (except 7B) and LLaMA-2, as they all have lower
Δfeedback and Success Rate (with feedback) compared to their base variants.
Another two exceptions are Vicuna and Lemur-v1; We speculate using multi-turn
conversations (ShareGPT) for SIFT contributes to these two exceptions.
-
Models hardly benefit from self-feedback. We find GPT-4-0613 using self-generated
feedback has
limited benefit: only decision-making has improved slightly.
LLMs' Ability to Provide Natural Language Feedback
In this section, we fixed the evaluated LLM (gpt-3.5-turbo-0613) and use different
LLMs to
provide language feedback.
This allows us to measure different LLMs' effectiveness in providing feedback.
We find that task-solving ability could be orthogonal to feedback-providing ability: LLM's
higher task-solving performance does not necessarily translate to better feedback-providing
capability and vice versa.
For example, despite performing the worst in solving tasks, CodeLLaMA (34B, SIFT) can
provide feedback that improves the stronger GPT-3.5.
BibTeX
@misc{wang2023mint,
title={MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback},
author={Xingyao Wang and Zihan Wang and Jiateng Liu and Yangyi Chen and Lifan Yuan and Hao Peng and Heng Ji},
year={2023},
eprint={2309.10691},
archivePrefix={arXiv},
primaryClass={cs.CL}
}