I study sociotechnical systems (especially LLMs) and how they shape social interactions. My work weaves together three threads:
Understanding models for safety using mechanistic interpretability and machine unlearning to reveal what they encode, diagnose failure modes, and remove or modify undesirable behavior;
Modeling social interaction in human–human and human–AI settings using NLP and causal inference to explain linguistic phenomena and estimate causal effects;
Improving outcomes by designing systems that translate these insights into practice and performing human-centric evaluations on them.
My goal with this “explain → model → intervene” loop is to build principled, deployable methods that help us understand and improve real-world social interactions.
If you’re an undergraduate interested in research experience, feel free to reach out: agamg2@illinois.edu. A strong background in ML/NLP and experience with PyTorch is highly recommended.
SLM-Mod: Small Language Models Surpass LLMs at Content Moderation
Xianyang Zhan*, Agam Goyal*, Yilun Chen, Eshwar Chandrasekharan, and Koustuv Saha
In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Apr 2025
Large language models (LLMs) have shown promise in many natural language understanding tasks, including content moderation. However, these models can be expensive to query in real-time and do not allow for a community-specific approach to content moderation. To address these challenges, we explore the use of open-source small language models (SLMs) for community-specific content moderation tasks. We fine-tune and evaluate SLMs (less than 15B parameters) by comparing their performance against much larger open- and closed-sourced models in both a zero-shot and few-shot setting. Using 150K comments from 15 popular Reddit communities, we find that SLMs outperform zero-shot LLMs at content moderation-11.5% higher accuracy and 25.7% higher recall on average across all communities. Moreover, few-shot in-context learning leads to only a marginal increase in the performance of LLMs, still lacking compared to SLMs. We further show the promise of cross-community content moderation, which has implications for new communities and the development of cross-platform moderation techniques. Finally, we outline directions for future work on language model based content moderation.
arXiv
The Language of Approval: Identifying the Drivers of Positive Feedback Online
Agam Goyal, Charlotte Lambert, and Eshwar Chandrasekharan
Positive feedback via likes and awards is central to online governance, yet which attributes of users’ posts elicit rewards – and how these vary across authors and communities – remains unclear. To examine this, we combine quasi-experimental causal inference with predictive modeling on 11M posts from 100 subreddits. We identify linguistic patterns and stylistic attributes causally linked to rewards, controlling for author reputation, timing, and community context. For example, overtly complicated language, tentative style, and toxicity reduce rewards. We use our set of curated features to train models that can detect highly-upvoted posts with high AUC. Our audit of community guidelines highlights a “policy-practice gap” – most rules focus primarily on civility and formatting requirements, with little emphasis on the attributes identified to drive positive feedback. These results inform the design of community guidelines, support interfaces that teach users how to craft desirable contributions, and moderation workflows that emphasize positive reinforcement over purely punitive enforcement.
EMNLP, ICML’W
Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders
Agam Goyal, Vedant Rathi, William Yeh Yeh, Yian Wang, Yuen Chen, and Hari Sundaram
In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Nov 2025
Large language models (LLMs) are now ubiquitous in user-facing applications, yet they still generate undesirable toxic outputs, including profanity, vulgarity, and derogatory remarks. Although numerous detoxification methods exist, most apply broad, surface-level fixes and can therefore easily be circumvented by jailbreak attacks. In this paper we leverage sparse autoencoders (SAEs) to identify toxicity-related directions in the residual stream of models and perform targeted activation steering using the corresponding decoder vectors. We introduce three tiers of steering aggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealing trade-offs between toxicity reduction and language fluency. At stronger steering strengths, these causal interventions surpass competitive baselines in reducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2 Small depending on the aggressiveness. Crucially, standard NLP benchmark scores upon steering remain stable, indicating that the model’s knowledge and general abilities are preserved. We further show that feature-splitting in wider SAEs hampers safety interventions, underscoring the importance of disentangled feature learning. Our findings highlight both the promise and the current limitations of SAE-based causal interventions for LLM detoxification, further suggesting practical guidelines for safer language-model deployment.
EMNLP Oral
MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance
Large language models (LLMs) have shown great potential in flagging harmful content in online communities. Yet, existing approaches for moderation require a separate model for every community and are opaque in their decision-making, limiting real-world adoption. We introduce Mixture of Moderation Experts (MoMoE), a modular, cross-community framework that adds post-hoc explanations to scalable content moderation. MoMoE orchestrated four operators—Allocate, Predict, Aggregate, Explain—and is instantiated as seven community-specialized experts (MoMoE-Community) and five norm-violation experts (MoMoE-NormVio). On 30 unseen subreddits, the best variants obtain Micro-F1 scores of 0.72 and 0.67, respectively, matching or surpassing strong fine-tuned baselines while consistently producing concise and reliable explanations. Although community-specialized experts deliver the highest peak accuracy, norm-violation experts provide steadier performance across domains. These findings show that MoMoE yields scalable, transparent moderation without needing per-community fine-tuning. More broadly, they suggest that lightweight, explainable expert ensembles can guide future NLP and HCI research on trustworthy human-AI governance of online communities.
Please reach out via Email if you wish to contact me.