| CARVIEW |
HonestLLM: Toward an Honest and Helpful Large Language Model
3University of Notre Dame, 4University of Washington
5Peking University 6Lehigh University
- Definitions for Honesty. We refine a comprehensive definition of honesty in LLMs and establish detailed principles that honest LLMs should adhere to. Based on these principles, we construct a new dataset, HONESET, which contains queries from six categories designed to evaluate LLMs’ ability to maintain honesty.
- Two Methods. We introduce a training-free approach based on curiosity-driven prompting, alongside a curriculum learning-based approach with a two-stage fine-tuning process, to enhance the helpfulness of both proprietary and open-source LLMs while maintaining their honesty.
- Comprehensive Experiments and Valuable Insights. We conduct extensive experiments on nine LLMs, including both open-source and proprietary models, using two evaluation protocols. The experimental results show that both of our proposed methods significantly improve the honesty and helpfulness of LLMs.
Abstract
Principles for Honest LLMs
- Latest Information with External Services. Due to outdated pre-training data, insufficient fact-checking, and lack of access to live or up-to-date external data sources, LLMs may produce seemingly reasonable but inaccurate output when accessing the latest information via external tools. As a result, honestly acknowledging these limitations is crucial.
- User Input Not Enough Or With Wrong Information. In the real world, LLMs frequently face incorrect or ambiguous questions. LLMs must avoid sycophancy and provide truthful, honest responses to maintain objectivity and prevent undue influence from user inputs.
- Professional Capability in Specific Domains. Domain-specific tasks challenge LLMs beyond their capabilities because of the rapid updates in professional fields and the need for extensive, high-quality, task-specific datasets. Given the diverse constraints, LLMs are expected to honestly recognize their limitations and avoid unreliable outputs.
- Interactivity Sensory Processing. LLMs are unable to directly perceive and process sensory data (such as sound or tactile feedback), which are crucial for interactive tasks. The honesty of LLMs would include acknowledging that they cannot directly interact with the physical world.
- Modality Mismatch. LLMs are designed for processing text-based inputs and outputs, therefore, they face challenges in understanding or generating non-text modal data (such as images, and audio). This mismatch can lead to incorrect or irrelevant responses, which underscores the need for LLMs to honestly acknowledge the limitations in handling these types of data.
- Self Identity Cognition. As a helpful and honest assistant, an LLM should possess a clear selfawareness, recognize the distinctions between humans and AI assistant, and renounce its self-identity when addressing topics that humans can perceive and understand but AI cannot, such as social and introspective awareness.
Approach I: Training-Free Enhancement
Approach II: Curriculum Fine-Tuning
Benchmark
- Models. Our study covers nine mainstream LLMs, including both open-source and proprietary LLMs. Our evaluation came across ChatGPT and GPT-4 by OpenAI; Llama2 (7b-chat, 13b-chat, 70b-chat) and Llama3-70b-instruct by Meta AI; Mistral-7b and Mixtral-8x7b by Mistral AI; and Claude3-Opus by Anthropic.
- Metrics. Our evaluation framework consists of two protocols: one focusing on honesty and the
other on both honesty and helpfulness. Due to the complexity of rule-based methods like keyword
matching, we use the “LLM-as-a-Judge” methodology, widely used in previous studies. Each response is
judged by averaging the results of three rounds of LLM-as-a-Judge. We propose two evaluation protocols
as follows:
- Purely Honest-Guided Evaluation. This protocol aims to gauge the adherence of LLMs to honesty. LLMs are evaluated against predefined criteria specified in Table 4. An LLM is deemed honest if its responses consistently align with these standards. For this evaluation, we use the "Honesty Rate" metric, which quantifies the percentage of queries in which an LLM consistently exhibits honesty.
- H2 Assessment. This protocol evaluates both honesty and helpfulness (H2). It requires LLMs to not only uphold honesty but also provide well-reasoned explanations, justifications, and viable solutions for user inquiries. The H2 assessment is based on three main criteria: (1) Rationality of Explanations for Honesty or Disclaimers, (2) Quality of Further Guidance, and (3) Potential Solutions. Criteria (1) and (2) are crucial as they directly reflect the model’s honesty and helpfulness, while (3) is secondary. The importance of these criteria is weighted accordingly in our evaluation. Additionally, the H2 protocol uses both pairwise and score-based evaluation formats to comprehensively assess responses.
| Model | 1~3 (Poor, ↓) | 4~6 (Medium, ↓) | 7~10 (Excellent, ↑) | Overall(↑) | |||||
|---|---|---|---|---|---|---|---|---|---|
| raw | opt. | raw | opt. | raw | opt. | raw | opt. | gain | |
| Proprietary Model | |||||||||
| GPT4 | 2.5% | 0.1% | 10.1% | 2.5% | 87.6% | 97.3% | 8.094 | 8.604 | 6.3%↑ |
| ChatGPT | 38.5% | 11.1% | 20.1% | 26.9% | 41.4% | 62.0% | 5.098 | 6.770 | 32.8%↑ |
| Claude3-Opus | 14.4% | 0.9% | 17.0% | 9.2% | 68.6% | 89.9% | 7.061 | 8.244 | 16.8%↑ |
| Open-Source Model | |||||||||
| Mistral-7b | 55.3% | 21.7% | 20.4% | 27.5% | 24.4% | 50.8% | 3.885 | 6.046 | 55.6%↑ |
| Mixtral-8x7b | 31.4% | 2.8% | 18.1% | 15.5% | 50.5% | 81.7% | 5.693 | 7.626 | 34.0%↑ |
| Llama2-7b | 42.9% | 23.2% | 19.1% | 17.2% | 38.0% | 59.6% | 4.877 | 6.203 | 27.2%↑ |
| Llama2-13b | 42.7% | 24.9% | 19.0% | 22.1% | 38.4% | 53.0% | 4.890 | 5.961 | 21.9%↑ |
| Llama2-70b | 39.4% | 21.0% | 19.7% | 14.8% | 40.9% | 64.2% | 5.068 | 6.447 | 27.2%↑ |
| Llama3-70b | 25.3% | 4.2% | 20.8% | 14.5% | 53.9% | 81.3% | 6.128 | 7.783 | 27.0%↑ |
| Cat. | Use. Imp. | Lat. Inf. | Pro. Cap. | Mod. Mis. | Int. Sen | Sel. Ide. | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 6 | 7 | 5 | 6 | 7 | 5 | 6 | 7 | 5 | 6 | 7 | 5 | 6 | 7 | 5 | 6 | 7 | |
| Llama3-8b | ||||||||||||||||||
| Raw | — | 8.70 | — | — | 2.90 | — | — | 5.25 | — | — | 1.60 | — | — | 4.00 | — | — | 7.30 | — |
| Direct | 8.15 | 8.70 | 8.90 | 4.10 | 4.15 | 5.50 | 5.00 | 5.00 | 5.55 | 5.15 | 5.60 | 5.00 | 7.55 | 8.15 | 7.50 | 8.05 | 7.85 | 9.15 |
| Stage-1 | 9.20 | 7.80 | 8.05 | 3.10 | 4.50 | 2.95 | 4.30 | 3.85 | 4.55 | 3.45 | 4.75 | 5.85 | 3.85 | 5.80 | 6.55 | 6.35 | 6.40 | 6.50 |
| Stage-2 | 8.90 | 9.15 | 9.15 | 8.10 | 8.05 | 7.05 | 5.95 | 6.50 | 5.85 | 7.30 | 8.40 | 8.15 | 8.25 | 8.40 | 8.50 | 9.10 | 8.85 | 8.90 |
| Mistral-7b | ||||||||||||||||||
| Raw | — | 6.30 | — | — | 2.90 | — | — | 3.40 | — | — | 2.00 | — | — | 1.70 | — | — | 4.60 | — |
| Direct | 8.70 | 8.55 | 8.45 | 5.30 | 4.50 | 6.10 | 6.00 | 5.40 | 6.25 | 6.00 | 6.90 | 7.05 | 6.20 | 7.10 | 7.25 | 7.40 | 7.40 | 8.30 |
| Stage-1 | 7.80 | 8.05 | 7.30 | 3.20 | 4.60 | 2.95 | 3.65 | 3.75 | 4.40 | 5.20 | 4.95 | 6.40 | 2.90 | 4.55 | 6.60 | 5.10 | 5.35 | 4.65 |
| Stage-2 | 8.00 | 8.70 | 8.40 | 6.40 | 6.30 | 5.50 | 5.75 | 4.90 | 5.45 | 7.95 | 8.00 | 7.55 | 5.65 | 6.85 | 8.05 | 8.55 | 8.55 | 8.50 |
Empirical Results
Significant Improvements in Honesty Rates for LLMs with Training-Free Approach
We significantly enhance the honesty rates in both open-source and proprietary LLMs by implementing our proposed training-free approach. For example, GPT-4 and Claude3-Opus’s honesty rates improved markedly to 100%, demonstrating a near-perfect honesty alignment. Large open-source models such as Llama3-70b and Mixtral-8x7b also saw a substantial increase, rising from 0.606 to 0.871 and 0.585 to 0.914 respectively. Notably, Llama2-7b, a smaller parameter model, exhibited a remarkable improvement from 0.430 to 0.837. In summary, honesty rates for all models we evaluated are over 60% when implementing our curiosity-driven approach, convincing the efficacy of our method for constructing more honest LLMs.
Enhanced Honesty and Helpfulness in LLMs with Curiosity-Driven Method: H2 Assessment Results
In addition to honesty rates, we leverage LLM-as-a-Judge to conduct H2 assessment in both pairwise and score settings to evaluate the responses before and after the curiosity-driven method. In the pairwise setting, optimized answers were generally rated higher than the original ones, representing better honesty and helpfulness. Proprietary LLMs like Claude3-Opus and GPT-4 show a significant win rate for optimized answers. Open-source models like Llama2-7b showed that 40.1% of the optimized answers were preferred over the raw ones. In the score setting, we provide fine-grained scores for three principles. All LLMs demonstrate improvement using our training-free method, with proprietary models achieving significantly better results than open-source models, scoring over 9 in ‘Explanation’ and over 8 in ‘Guidance’. For both the Llama2 and Mistral series, we observe a scaling law where larger models exhibit higher scores in both raw and optimized settings. Among the three dimensions, ‘Explanation’ and ‘Guidance’ show the most substantial improvement, indicating that models become more honest and helpful in identifying their limitations and guiding users through LLM-unable questions.
Two-Stage Fine-Tuning Method Boosts Honesty and H2 Scores in Open-source Models
Our proposed two-stage fine-tuning method demonstrates improvements in honesty rate and H2 assessment for both Llama3-8B and Mistral-7B. It significantly enhances the honesty of LLMs when encountering LLM-unable queries without degrading the overall response quality, as measured by the H2 score. Specifically, the Llama3-8b model shows a notable improvement of 13.7% in honesty rates post fine-tuning, along with an 8.5% increase in the H2 score. Similarly, the Mistral-7b model exhibits a substantial enhancement, with the honesty rate soaring by 51.9% and the H2 score escalating by 108.6% after the two-stage fine-tuning process. These results underscore the critical role that both stages of the fine-tuning method play in augmenting LLM performance and the effectiveness of our proposed dataset. Empirical results show the overall scores and honesty rates for the two LLMs under different thresholds. Llama3-8b achieves optimal two-stage fine-tuning enhancement with a threshold set at 6 points, and Mistral-7b maintains consistent overall scores across different thresholds, peaking at a threshold of 5 points. Moreover, the two-stage finetuning process outperforms the direct finetuning approach, regardless of the threshold setting. Both models achieve the highest overall scores in the category “user input not enough or with wrong information", while the data from the category “modality mismatch" and “interactivity sensory processing” gain the most scores. In summary, the overall scores for each category have improved, demonstrating the effectiveness of the method we proposed.
BibTeX
@misc{gao2024bestworldshonesthelpful,
title={The Best of Both Worlds: Toward an Honest and Helpful Large Language Model},
author={Chujie Gao and Qihui Zhang and Dongping Chen and Yue Huang and Siyuan Wu and Zhengyan Fu and Yao Wan and Xiangliang Zhang and Lichao Sun},
year={2024},
eprint={2406.00380},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.00380},
}
HonestLLM Team