I am a fifth-year CS Ph.D. student at the University of Maryland’s Computational Linguistics and Information Processing (CLIP) lab, advised by Wei Ai. My research interests are Natural Language Processing (NLP) and Computational Social Science. I am interested in uncovering patterns from data and investigate how these patterns correlate with human behaviors, particularly in education and social media contexts. I also study how such patterns influence the behaviors of language models. At UMD, I also closely work with Jing Liu, Louiqa Raschid, Vanessa Frias-Martinez, and Furong Huang.
In the past, I have interned at Adobe Research. Previously, I graduated from Johns Hopkins University with M.S.E in Computer Science, where I worked with Mark Dredze at Center for Language and Speech Processing. I obtained my B.E in Computer Science from Southwest University, where I worked with Yong Deng on Complex Network, and Tao Zhou on Human Mobility.
I started my internship at Adobe Research, working on personalization and style understanding for designers.
Mar 2025
Our survey on Large Language Models and Causal Inference is accepted to the Findings of NAACL 2025.
May 2024
Our paper on concept level spurious correlations for text classification, with Yuhang Zhou, is accepted to ACL 2024.
Mar 2024
Our paperThe Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education is accepted to NAACL 2024. See you in Mexico City!
Jan 2024
Our paper Twitter social mobility data reveal demographic variations in social distancing practices during the COVID-19 pandemic is available on Scientific Reports!
Social media platforms like Twitter (now X) have been pivotal in information dissemination and public engagement, especially during COVID-19. A key goal for public health experts was to encourage prosocial behavior that could impact local outcomes such as masking and social distancing. Given the importance of local news and guidance during COVID-19, the objective of our research is to analyze the effect of localized engagement, on social media conversations. This study examines the impact of geographic co-location, as a proxy for localized engagement between public health experts (PHEs) and the public, on social media. We analyze a Twitter conversation dataset from January 2020 to November 2021, comprising over 19 K tweets from nearly five hundred PHEs, along with approximately 800 K replies from 350 K participants. Our findings reveal that geo-co-location is associated with higher engagement rates, especially in conversations on topics including masking, lockdowns, and education, and in conversations with academic and medical professionals. Lexical features associated with emotion and personal experiences were more common in geo-co-located contexts. This research provides insights into how geographic co-location influences social media engagement and can inform strategies to improve public health messaging.
The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education
Paiheng Xu, Jing Liu , Nathan Jones , Julie Cohen , and Wei Ai
Assessing instruction quality is a fundamental component of any improvement efforts in the education system. However, traditional manual assessments are expensive, subjective, and heavily dependent on observers’ expertise and idiosyncratic factors, preventing teachers from getting timely and frequent feedback. Different from prior research that mostly focuses on low-inference instructional practices on a singular basis, this paper presents the first study that leverages Natural Language Processing (NLP) techniques to assess multiple high-inference instructional practices in two distinct educational settings: in-person K-12 classrooms and simulated performance tasks for pre-service teachers. This is also the first study that applies NLP to measure a teaching practice that is widely acknowledged to be particularly effective for students with special needs. We confront two challenges inherent in NLP-based instructional analysis, including noisy and long input data and highly skewed distributions of human ratings. Our results suggest that pretrained Language Models (PLMs) demonstrate performances comparable to the agreement level of human raters for variables that are more discrete and require lower inference, but their efficacy diminishes with more complex teaching practices. Interestingly, using only teachers’ utterances as input yields strong results for student-centered variables, alleviating common concerns over the difficulty of collecting and transcribing high-quality student speech data in in-person teaching settings. Our findings highlight both the potential and the limitations of current NLP techniques in the education domain, opening avenues for further exploration.
Explore Spurious Correlations at the Concept Level in Language Models for Text Classification
Yuhang Zhou , Paiheng Xu, Xiaoyu Liu , Bang An , Wei Ai , and Furong Huang
Language models (LMs) have achieved notable success in numerous NLP tasks, employing both fine-tuning and in-context learning (ICL) methods. While language models demonstrate exceptional performance, they face robustness challenges due to spurious correlations arising from imbalanced label distributions in training data or ICL exemplars. Previous research has primarily concentrated on word, phrase, and syntax features, neglecting the concept level, often due to the absence of concept labels and difficulty in identifying conceptual content in input texts. This paper introduces two main contributions. First, we employ ChatGPT to assign concept labels to texts, assessing concept bias in models during fine-tuning or ICL on test data. We find that LMs, when encountering spurious correlations between a concept and a label in training or prompts, resort to shortcuts for predictions. Second, we introduce a data rebalancing technique that incorporates ChatGPT-generated counterfactual data, thereby balancing label distribution and mitigating spurious correlations. Our method’s efficacy, surpassing traditional token removal approaches, is validated through extensive testing.