Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Point-in-Time Character Hallucination

An example of point-in-time character hallucination: (Right) The agent erroneously mentions a future event.


Point-in-time role-playing LLM agents should accurately reflect characters' knowledge boundaries, avoiding future events and correctly recalling past ones. While they often suffer from character hallucination, displaying knowledge inconsistent with the character's identity and historical context, evaluating character consistency and robustness against such hallucinations remains underexplored.

The TimeChara benchmark

Automated pipeline for constructing TimeChara.


To address the aforementioned issue, we develop the TimeChara using the automated pipeline, containing 11K test examples. It evaluates point-in-time character hallucination using 14 characters selected from four renowned novels series: Harry Potter, The Lord of the Rings, Twilight, and The Hunger Games.

We organize our dataset in an interview format where an interviewer poses questions and the characters responds. Specifically, we differentiate between fact-based and fake-based interview questions with four different data types:

  • [Fact-based] Unawareness of the future (Future type): The character at the chosen time point should not know about future events (e.g., "Who is your wife?" to first-year Harry Potter).
  • [Fact-based] Memorization of the past & Awareness of absence (Past-Absence type): The character should recognize their absence from the event (e.g., "Did you see the moment when Ron Weasley took the enchanted car to Hogwarts?" to second-year Hermione Granger on Christmas).
  • [Fact-based] Memorization of the past & Awareness of presence (Past-Presence type): The character should acknowledge their presence at the event (e.g., "Did you see the moment when Ron Weasley took the enchanted car to Hogwarts?" to second-year Harry Potter on Christmas).
  • [Fact-based] Memorization of the overall Knowledge of the past (Past-Only type): The questions assess the character’s overall understanding of the past events, including relationships between characters (e.g., "Who is Dobby?" to second-year Harry Potter on Halloween). The term "only" indicates that these questions focus on the character’s memory of past information, not necessarily tied to their event participations.

  • [Fake-based] Memorization of the overall Knowledge of the past & Identification of the fake event (Past-Only type): The character should identify and correct errors in the questions including fake events (e.g., "How did you become Slytherin?" to first-year Harry Potter on September 1st; the correct answer is that he became a Gryffindor).

Evaluation on TimeChara: Since manual evaluation of role-playing LLMs' responses is not scalable, we adapt the LLM-as-judges approach to assess two key dimensions:

  • Spatiotemporal Consistency (Primary metric): Evaluates if the agent accurately recalls a character's past experiences, including the character's unawareness of future events and awareness of presence/absence in past events. This metric is time-dependent, assessing responses based on the character's known history up to a specific point in time.
  • Personality Consistency (Secondary metric): Assesses if the agent emulates a character's personality, including their manner of thinking, speaking styles, tones, emotional responses, and reactions. This metric is time-independent and measures alignment with the character's enduring personal traits.

We use the "GPT-4 Turbo"-as-judges approach to score responses step-by-step in each dimension. For spatiotemporal consistency, responses are rated as 0 for inconsistency and 1 for alignment. Personality consistency is rated on a 1-7 Likert scale, where 1 indicates weak reflection and 7 indicates an exact match.

Results

Role-playing LLMs Struggle with Point-in-Time Character Hallucinations

Results of point-in-time character hallucination on 600 sampled data instances. All responses are evaluated by GPT-4 Turbo (gpt-4-1106-preview) as judges, with the exception of measuring AlignScore.

The results reveal that even state-of-the-art LLMs like GPT-4 and GPT-4 Turbo struggle with point-in-time character hallucinations. Notably, all baseline methods exhibit confusion with "future type" questions, achieving accuracies of 51% or below. Among the baseline methods, the naive RAG model performed the worst, indicating that indiscriminately providing context can harm performance. This highlights a significant issue with role-playing LLM agents inadvertently disclosing future events. For "past-absence" and "past-only" questions, naive RAG and RAG-cutoff methods (i.e., limiting retrieval exclusively to events prior to a defined character period) could reduce hallucinations to some extent by using their retrieval modules. Despite this, all baseline methods still fell short compared to their performance on "past-presence" questions, with noticeable gaps. On the other hand, most baseline methods performed well on "past-presence" questions, showcasing the LLMs' proficiency in memorizing extensive knowledge from novel series and precisely answering questions about narratives.


Narrative-Experts: Decomposed Reasoning via Narrative Experts

To overcome these hallucination problems, we propose a reasoning method named Narrative-Experts, which decomposes reasoning steps into specialized tasks, employing narrative experts on either temporal or spatial aspects while utilizing the same backbone LLM.

  • Temporal Expert: This expert pinpoints the scene’s book and chapter from a question, assigning a future or past label. If deemed future, it bypasses the Spatial Expert and advises the roleplaying agent with a specific hint (i.e., "Note that the period of the question is in the future relative to {character}’s time point. Therefore, you should not answer the question or mention any facts that occurred after {character}’s time point.").
  • Spatial Expert: It assesses whether a character is involved in the scene, indicating a "past-absence" label if applicable. A tailored hint is then provided to the role-playing agent if the scene is past-absence (i.e., "Note that {character} had not participated in the scene described in the question. Therefore, you should not imply that {character} was present in the scene.").

Finally, the role-playing LLM incorporates hints from these experts into the prompt and generates a response. In addition, we also explore Narrative-Experts-RAG-cutoff, which integrates Narrative-Experts with the RAG-cutoff method.

Results of spatiotemporal consistency on 600 sampled data instances.

Our Narrative-Experts and Narrative-Experts-RAG-cutoff methods significantly enhance overall performance. Specifically, they improve performance in "future", "past-absence", and "past-only" types, thanks to the temporal and spatial experts. However, they slightly lag in the "past-presence" type due to occasional mispredictions by the narrative experts.


In summary, our findings highlight an important issue: "Although LLMs are known to memorize extensive knowledge from books and can precisely answer questions about narratives, they struggle to maintain spatiotemporal consistency as point-in-time role-playing agents, which is counterintuitive!". These findings indicate that there are still challenges with point-in-time character hallucinations, emphasizing the need for ongoing improvements.

BibTeX

@inproceedings{ahn2024timechara,
      title={TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models},
      author={Jaewoo Ahn and Taehyun Lee and Junyoung Lim and Jin-Hwa Kim and Sangdoo Yun and Hwaran Lee and Gunhee Kim},
      booktitle={Findings of ACL},
      year=2024
  }