| CARVIEW |
firstname.lastname at microsoft dot com
Pei Zhou
I am a Senior Applied Scientist at Microsoft Office of Applied Research, where I drive research on improving Copilot reasoning for complex intent, through learning from interaction, synthetic data generation, and UX innovation. I received my Ph.D. in Computer Science at University of Southern California, where I work in the USC-NLP Group. My research interests are: large language models (LLM) reasoning, communicating agents, and human-AI symbiosis.
Previously, I received undergraduate degree in Mathematics of Computation from University of California, Los Angeles (UCLA). I've also done research at Google Deepmind (Gemini), Allen Institute for Artificial Intelligence (AI2), and Amazon Alexa AI. My work has been featured in The Register, Science Daily, VentureBeat, Tech Xplore, etc.
We are on lookout for exceptional applied science talents! If you have passion to improve LLM systems from interaction and excited about applied research impacts, feel free to drop me a line :)
[Full CV] [Microsoft Profile] [Twitter] [LinkedIn] [Github] [Google Scholar] [Semantic Scholar]
Recent News
- [October 2024] Attending COLM in Penn and we are hiring interns and FTE, come say hi!
- [December 2023] Attending NeurIPS in New Orleans, come say hi!
- [September 2023] Started my collaboration with Google DeepMind working on LLM meta-task reasoning!
- [July 2023] Attending ACL at Toronto, come say hi! I'll be presenting our paper on a Dungeon Master-like dialogue agent in D&D with theory-of-mind and RL!
- [May 2023] Started my internship at Google Bard working on theory-of-mind capabilities in LLMs!
- [April 2023] Our Theory-of-Mind workshop is accepted in ICML 2023! Hope to see you in Honolulu in July!
Education
Aug 2019 - April 2024
Ph.D. in Computer Science, University of Southern California
Sep 2015 - June 2019
B.S. in Mathematics of Computation with Minor in Statistics, University of California, Los Angeles (UCLA)
Selected Publications
(Full list see Google Scholar)
-
SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures
Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tzi Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng.
NeurIPS, 2024.
[abstract] [media coverage 1] [media coverage 2] [media coverage 3]We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a self-discovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding. SELF-DISCOVER substantially improves GPT-4 and PaLM 2’s performance on challenging reasoning benchmarks such as BigBench-Hard, grounded agent reasoning, and MATH, by as much as 32% compared to Chain of Thought (CoT). Furthermore, SELF-DISCOVER outperforms inference-intensive methods such as CoT-Self-Consistency by more than 20%, while requiring 10-40x fewer inference compute. Finally, we show that the self-discovered reasoning structures are universally applicable across model families: from PaLM 2-L to GPT-4, and from GPT-4 to Llama2, and share commonalities with human reasoning patterns.
-
How FaR Are Large Language Models From Agents with Theory-of-Mind?
Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R. McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra Aida Nematzadeh, Shyam Upadhyay, and Manaal Faruqui.
Preprint, 2023.
[abstract]Thinking is for Doing. Humans can infer other people’s mental states from observations–an ability called Theory-of-Mind (ToM)–and subsequently act pragmatically on those inferences. Existing question answering benchmarks such as ToMi ask models questions to make inferences about beliefs of characters in a story, but do not test whether models can then use these inferences to guide their actions. We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D), which requires models to connect inferences about others’ mental states to actions in social scenarios. Experiments on T4D demonstrate that LLMs such as GPT-4 and PaLM 2 seemingly excel at tracking characters’ beliefs in stories, but they struggle to translate this capability into strategic action. Our analysis reveals the core challenge for LLMs lies in identifying the implicit inferences about mental states without being explicitly asked about as in ToMi, that lead to choosing the correct action in T4D. To bridge this gap, we introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges and reason about potential actions. FaR boosts GPT-4’s performance from 50% to 71% on T4D, outperforming other prompting methods such as Chain-of-Thought and Self-Ask. Moreover, FaR generalizes to diverse out-of-distribution story structures and scenarios that also require ToM inferences to choose an action, consistently outperforming other methods including few-shot in-context learning.
-
SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Le Bras, Malihe Alikhani Gunhee Kim, Maarten Sap, and Yejin Choi.
In EMNLP, 2023. Outstanding Paper Award
[abstract]Data scarcity has been a long standing issue in the field of open-domain social dialogue. To quench this thirst, we present SODA: the first publicly available, million-scale high-quality social dialogue dataset. By contextualizing social commonsense knowledge from a knowledge graph, we are able to distill an exceptionally broad spectrum of social interactions from a large language model. Human evaluation shows that conversations in SODA are more consistent, specific, and (surprisingly) natural than those in prior human-authored datasets Using SODA, we train COSMO: a generalizable conversation model that is significantly more natural and consistent on unseen datasets than best-performing conversation models (e.g., GODEL, BlenderBot-1, Koala, Vicuna). Experiments reveal COSMO is sometimes even preferred to the original human-written gold responses. Additionally, our results shed light on the distinction between knowledge-enriched conversations and natural social chitchats. We make our data, models, and code public.
-
I Cast Detect Thoughts: Learning to Converse and Guide with Intents and
Theory-of-Mind in Dungeons and Dragons
Pei Zhou, Andrew Zhu, Jennifer Hu, Jay Pujara, Xiang Ren, Chris Callison-Burch, Yejin Choi, and Prithviraj Ammanabrolu.
In ACL, 2023.
[abstract] [media coverage]We propose a novel task, G4C, to study teacher-student natural language interactions in a goal-driven and grounded environment. Dungeons and Dragons (D&D), a role-playing game, provides an ideal setting to investigate such interactions. Here, the Dungeon Master (DM), i.e., the teacher, guides the actions of several players—students, each with their own personas and abilities—to achieve shared goals grounded in a fantasy world. Our approach is to decompose and model these interactions into (1) the DM’s intent to guide players towards a given goal; (2) the DM’s guidance utterance to the players expressing this intent; and (3) a theory-of-mind (ToM) model that anticipates the players’ reaction to the guidance one turn into the future. We develop a novel reinforcement learning (RL) method for training a DM that generates guidance for players by rewarding utterances where the intent matches the ToM-anticipated player actions. Human and automated evaluations show that a DM trained to explicitly model intents and incorporate ToM of the players using RL generates better-quality guidance that is 3x more likely to fulfill the DM’s intent than a vanilla natural language generation (NLG) approach.
-
Reflect, Not Reflex: Inference-Based Common Ground Improves
Dialogue Response Quality
Pei Zhou, Hyundong Cho, Pegah Jandaghi, Dong-Ho Lee, Bill Yuchen Lin, Jay Pujara, and Xiang Ren.
In EMNLP, 2022.
[abstract] [project page] [dataset] [media coverage]Human communication relies on common ground (CG), the mutual knowledge and beliefs shared by participants, to produce coherent and interesting conversations. In this paper, we demonstrate that current response generation (RG) models produce generic and dull responses in dialogues because they act reflexively, failing to explicitly model CG, both due to the lack of CG in training data and the standard RG training procedure. We introduce Reflect, a dataset that annotates dialogues with explicit CG (materialized as inferences approximating shared knowledge and beliefs) and solicits 9k diverse human-generated responses each following one common ground. Using Reflect, we showcase the limitations of current dialogue data and RG models: less than half of the responses in current data is rated as high quality (sensible, specific, and interesting) and models trained using this data have even lower quality, while most Reflect responses are judged high quality. Next, we analyze whether CG can help models produce better quality responses by using Reflect CG to guide RG models. Surprisingly, we find that simply prompting GPT3 to “think” about CG generates 30% more quality responses, showing promising benefits to integrating CG into the RG process.
-
Think Before You Speak: Explicitly Generating Implicit Commonsense
Knowledge for Response Generation
Pei Zhou, Karthik Gopalakrishnan, Behnam Hedayatnia, Seokhwan Kim, Jay Pujara, Xiang Ren, Yang Liu, and Dilek Hakkani-Tur.
In ACL, 2022.
[abstract] [media coverage]Implicit knowledge, such as common sense, is key to fluid human conversations. Current neural response generation (RG) models are trained to generate responses directly, omitting unstated implicit knowledge. In this paper, we present Think-Before-Speaking (TBS), a generative approach to first externalize implicit commonsense knowledge (think) and use this knowledge to generate responses (speak). We expect that externalizing implicit knowledge allows more efficient learning, produces more informative responses, and enables more explainable models. We analyze different choices to collect knowledgealigned dialogues, represent implicit knowledge, and transition between knowledge and dialogues. Empirical results show TBS models outperform end-to-end and knowledgeaugmented RG baselines on most automatic metrics and generate more informative, specific, and commonsense-following responses, as evaluated by human annotators. TBS also generates knowledge that makes sense and is relevant to the dialogue around 85% of the time.
-
Probing Commonsense Explanation in Dialogue Response Generation
Pei Zhou, Pegah Jandaghi, Hyundong Cho, Bill Yuchen Lin, Jay Pujara, and Xiang Ren.
In EMNLP-Findings, 2021.
[abstract]Humans use commonsense reasoning (CSR) implicitly to produce natural and coherent responses in conversations. Aiming to close the gap between current response generation (RG) models and human communication abilities, we want to understand why RG models respond as they do by probing RG model’s understanding of commonsense reasoning that elicits proper responses. We formalize the problem by framing commonsense as a latent variable in the RG task and using explanations for responses as textual form of commonsense. We collect 6k annotated explanations justifying responses from four dialogue datasets and ask humans to verify them and propose two probing settings to evaluate RG models’ CSR capabilities. Probing results show that models fail to capture the logical relations between commonsense explanations and responses and fine-tuning on in-domain data and increasing model sizes do not lead to understanding of CSR for RG. We hope our study motivates more research in making RG models emulate the human reasoning process in pursuit of smooth human-AI communication.
-
Lawyers are Dishonest? Quantifying Representational Harms in
Commonsense Knowledge Resources
Ninareh Mehrabi*, Pei Zhou* (equal contribution), Fred Morstatter, Jay Pujara, Xiang Ren, and Aram Galstyan.
In EMNLP, 2021.
[abstract]Numerous natural language processing models have tried injecting commonsense by using the ConceptNet knowledge base to improve performance on different tasks. ConceptNet, however, is mostly crowdsourced from humans and may reflect human biases such as "lawyers are dishonest." It is important that these biases are not conflated with the notion of commonsense. We study this missing yet important problem by first defining and quantifying biases in ConceptNet as two types of representational harms: overgeneralization of polarized perceptions and representation disparity. We find that ConceptNet contains severe biases and disparities across four demographic categories. In addition, we analyze two downstream models that use ConceptNet as a source for commonsense knowledge and find the existence of biases in those models as well. We further propose a filtered-based bias-mitigation approach and examine its effectiveness. We show that our mitigation approach can reduce the issues in both resource and models but leads to a performance drop, leaving room for future work to build fairer and stronger commonsense models.
-
RICA: Evaluating Robust Inference Capabilities
Based on Commonsense Axioms
Pei Zhou, Rahul Khanna, Seyeon Lee, Bill Yuchen Lin, Daniel Ho, Jay Pujara, and Xiang Ren,
In EMNLP, 2021.
[abstract] [project page] [data]Pre-trained language models (PTLMs) have achieved impressive performance on commonsense inference benchmarks, but their ability to employ commonsense to make robust inferences, which is crucial for effective communications with humans, is debated. In the pursuit of advancing fluid human-AI communication, we propose a new challenge, RICA: Robust Inference using Commonsense Axioms, that evaluates robust commonsense inference despite textual perturbations. To generate data for this challenge, we develop a systematic and scalable procedure using commonsense knowledge bases and probe PTLMs across two different evaluation settings. Extensive experiments on our generated probe sets with more than 10k statements show that PTLMs perform no better than random guessing on the zero-shot setting, are heavily impacted by statistical biases, and are not robust to perturbation attacks. We also find that fine-tuning on similar statements offer limited gains, as PTLMs still fail to generalize to unseen inferences. Our new large-scale benchmark exposes a significant gap between PTLMs and human-level language understanding and offers a new challenge for PTLMs to demonstrate commonsense.
-
Commonsense-Focused Dialogues for Response Generation: An Empirical Study
Pei Zhou, Karthik Gopalakrishnan, Behnam Hedayatnia, Seokhwan Kim, Jay Pujara, Xiang Ren, Yang Liu, and Dilek Hakkani-Tur.
In SIGDIAL, 2021.
[abstract] [data] [Amazon blog]Smooth and effective communication requires the ability to perform latent or explicit commonsense inference. Prior commonsense reasoning benchmarks (such as SocialIQA and CommonsenseQA) mainly focus on the discriminative task of choosing the right answer from a set of candidates, and do not involve interactive language generation as in dialogue. Moreover, existing dialogue datasets do not explicitly focus on exhibiting commonsense as a facet. In this paper, we present an empirical study of commonsense in dialogue response generation. We first auto-extract commonsensical dialogues from existing dialogue datasets by leveraging ConceptNet, a commonsense knowledge graph. Furthermore, building on social contexts/situations in SocialIQA, we collect a new dialogue dataset with 25K dialogues aimed at exhibiting social commonsense in an interactive setting. We evaluate response generation models trained using these datasets and find that models trained on both extracted and our collected data produce responses that consistently exhibit more commonsense than baselines. Finally we propose an approach for automatic evaluation of commonsense that relies on features derived from ConceptNet and pretrained language and dialog models, and show reasonable correlation with human evaluation of responses’ commonsense quality. We are releasing a subset of our collected data, CommonsenseDialogues1 , containing about 11K dialogs.
-
CommonGen: A Constrained Text Generation Challenge
for Generative Commonsense Reasoning
Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren.
In EMNLP-Findings, 2020.
[abstract] [project page] [data] [media coverage]Recently, large-scale pretrained language models have demonstrated impressive performance on several commonsense-reasoning benchmark datasets. However, building machines with commonsense to compose realistically plausible sentences remains challenging. In this paper, we present a constrained text generation task, COMMONGEN associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts (e.g., {dog, frisbee, catch, throw}); the task is to generate a coherent sentence describing an everyday scenario using these concepts (e.g., “a man throws a frisbee and his dog catches it”). The COMMONGEN task is challenging because it inherently requires 1) relational reasoning with background commonsense knowledge, and 2) compositional generalization ability to work on unseen concept combinations. Our dataset, constructed through a combination of crowdsourced and existing caption corpora, consists of 79k commonsense descriptions over 35k unique concept-sets. Experiments show that there is a large gap between state-of-the-art text generation models (e.g., T5) and human performance. Furthermore, we demonstrate that the learned generative commonsense reasoning capability can be transferred to improve downstream tasks by generating additional context.
-
Examining Gender Bias in Languages with Grammatical Gender
Pei Zhou, Weijia Shi, Jieyu Zhao, Kuan-Hao Huang, Muhao Chen, Ryan Cotterell, and Kai-Wei Chang.
In EMNLP-IJCNLP, 2019.
[abstract] [code]Recent studies have shown that word embeddings exhibit gender bias inherited from the training corpora. However, most studies to date have focused on quantifying and mitigating such bias only in English. These analyses cannot be directly extended to languages that exhibit morphological agreement on gender, such as Spanish and French. In this paper, we propose new metrics for evaluating gender bias in word embeddings of these languages and further demonstrate evidence of gender bias in bilingual embeddings which align these languages with English. Finally, we extend an existing approach to mitigate gender bias in word embeddings under both monolingual and bilingual settings. Experiments on modified Word Embedding Association Test, word similarity, word translation, and word pair translation tasks show that the proposed approaches effectively reduce the gender bias while preserving the utility of the embeddings.
Internships
- Research Intern@Google Bard, Mountain View, CA, May-Aug 2023
- Research Intern@AI2-Mosaic, Seattle, WA, May-Dec 2022
- Applied Scientist Intern@Amazon Alexa AI, Remote, May-Aug 2021
- Applied Scientist Intern@Amazon Alexa AI, Remote, May-Aug 2020
Miscellany
- I play the piano and am a keyboardist in UCLA Accoustic Guitar Band called Parked in 4 East
- I was born in Chengdu, a great city for vacation and spicy food lovers :)
- Currently super into camping/glamping!
- I'm into all kinds of RPGs from tabletop to ffxiv
- Thanks to Nelson Liu for sharing source code of the website!