| CARVIEW |
Alexander Miserlis Hoyle
I'm currently a postdoctoral fellow at the ETH Zürich AI Center, where I sit in both natural language processing/machine learning and social science groups. I am on the academic and industry job market for the 2025-2026 cycle! If you think a position would be a good fit, please do reach out.
I received my PhD in Computer Science from the University of Maryland, where I was based in the Computational Linguistics and Information Processing lab. My wonderful advisor was Philip Resnik.
My research is oriented around the development and evaluation of methods for computational social science. My basic contention is that the field of NLP is well-served served by grounding it in the needs of social science—giving direction to concepts like interpretability and generalization. On the methods side, I'm interested in identifying latent constructs (perhaps call it "very abstractive summarization"): topic models, ideal points models, NLG for annotation. In terms of evaluation, I like to think about validity: are we really measuring what we want to measure? Topically, I'm drawn to work on bias & fairness and political science; more recently, I've started work on constructs in mental health, specifically those relating to suicidality. If this sounds at all interesting to you and you'd like to work together: please reach out!
During the PhD, I interned with both the FATE group at Microsoft Research and AllenNLP at AI2. Before the PhD, I completed my master's in computational statistics and machine learning at University College London, where my thesis advisors were Sebastian Riedel and Jeff Mitchell at the UCL NLP group.
After undergrad, I was a Research Analyst at The Brattle Group in Cambridge, Massachusetts, where I built econometric models, developed a document retrieval platform, and did additional work that could be classified as data science. In my most interesting project, I helped conduct research on New York City public housing for the U.S. Department of Justice; these efforts eventually led to a $2.2 billion settlement to improve conditions.
Since I'm often asked: I prefer to be called Alexander, not Alex (although I won't bite your head off if you assume the latter or forget). You can reach me at [firstname].[lastname]@ai.ethz.ch.
Publications
2025
-
Measuring scalar constructs in social science with LLMs
Hauke Licht*, Rupak Sarkar*, Patrick Y. Wu, Pranav Goel, Niklas Stoehr, Elliott Ash and Alexander Hoyle*. In EMNLP. 2025
-
The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure
Yu Fan, Yang Tian, Shauli Ravfogel, Mrinmaya Sachan, Elliott Ash and Alexander Hoyle. In EMNLP. 2025
-
How Persuasive Is My Context?
Tu Nguyen, Kevin Du, Alexander Hoyle and Ryan Cotterell. In EMNLP. 2025
-
ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering
Alexander Hoyle*, Lorena Calvo-Bartolomé*, Jordan Lee Boyd-Graber and Philip Resnik. In ACL. 2025
[Link]
Abstract
Topic models and document-clustering evaluations either use automated metrics that align poorly with human preferences, or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators{---}or an LLM-based proxy{---}review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxy is statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations. -
Large Language Models Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Topic Models
Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Daniel Kofi Stephens, Juan Francisco Fung, Alden Dima and Jordan Lee Boyd-Graber. In ACL. 2025
[Link]
Abstract
A common use of NLP is to facilitate the understanding of large document collections, with models based on Large Language Models (LLMs) replacing probabilistic topic models. Yet the effectiveness of LLM-based approaches in real-world applications remains under explored. This study measures the knowledge users acquire with topic models{---}including traditional, unsupervised and supervised LLM- based approaches{---}on two datasets. While LLM-based methods generate more human- readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to LLM-based topic models improves data exploration by addressing hallucination and genericity but requires more human efforts. In contrast, traditional models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly. This paper provides best practices{---}there is no one right model, the choice of models is situation-specific{---}and suggests potential improvements for scalable LLM- based topic models. -
Express Yourself (Ideologically): Legislators’ Ideal Points Across Audiences
SoRelle Gaynor, Kristina Miler, Pranav Goel, Alexander M. Hoyle and Philip Resnik. The Journal of Politics. 2025
-
PairScale: Analyzing Attitude Change with Pairwise Comparisons
Rupak Sarkar, Patrick Y. Wu, Kristina Miler, Alexander Hoyle and Philip Resnik. In Findings of NAACL. 2025
[Link]
Abstract
We introduce a text-based framework for measuring attitudes in communities toward issues of interest, going beyond the pro/con/neutral of conventional stance detection to characterize attitudes on a continuous scale using both implicit and explicit evidence in language. The framework exploits LLMs both to extract attitude-related evidence and to perform pairwise comparisons that yield unidimensional attitude scores via the classic Bradley-Terry model. We validate the LLM-based steps using human judgments, and illustrate the utility of the approach for social science by examining the evolution of attitudes on two high-profile issues in U.S. politics in two political communities on Reddit over the period spanning from the 2016 presidential campaign to the 2022 mid-term elections. WARNING: Potentially sensitive political content.
2024
-
Developing and Measuring Latent Constructs in Text
Alexander Hoyle. University of Maryland, PhD Thesis. 2024
-
TopicGPT: A Prompt-based Topic Modeling Framework
Chau Minh Pham, Alexander Hoyle, Simeng Sun, Philip Resnik and Mohit Iyyer. In NAACL. 2024
[Link]
Abstract
Topic modeling is a well-established technique for exploring text corpora. Conventional topic models (e.g., LDA) represent topics as bags of words that often require "reading the tea leaves" to interpret; additionally, they offer users minimal control over the formatting and specificity of resulting topics. To tackle these issues, we introduce TopicGPT, a prompt-based framework that uses large language models (LLMs) to uncover latent topics in a text collection. TopicGPT produces topics that align better with human categorizations compared to competing methods: it achieves a harmonic mean purity of 0.74 against human-annotated Wikipedia topics compared to 0.64 for the strongest baseline. Its topics are also more interpretable, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions. Moreover, the framework is highly adaptable, allowing users to specify constraints and modify topics without the need for model retraining. By streamlining access to high-quality and interpretable topics, TopicGPT represents a compelling, human-centered approach to topic modeling. -
A SMART Mnemonic Sounds like 'Glue Tonic': Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick
Nishant Balepur, Matthew Shu, Alexander Hoyle, Alison Robey, Shi Feng, Seraphina Goldfarb-Tarrant and Jordan Lee Boyd-Graber. In EMNLP. 2024
[Link]
Abstract
Keyword mnemonics are memorable explanations that link new terms to simpler keywords. Prior work generates mnemonics for students, but they do not train models using mnemonics students prefer and aid learning. We build SMART, a mnemonic generator trained on feedback from real students learning new terms.To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics.We then use LLM alignment to enhance SMART: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor.We gather 2684 preferences from 45 students across two types: **expressed** (inferred from ratings) and observed (inferred from student learning), yielding three key findings. First, expressed and observed preferences disagree; what students *think* is helpful does not always capture what is *truly* helpful. Second, Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal. SMART is tuned via Direct Preference Optimization on this signal, which resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains. Third, mnemonic experts assess SMART as matching GPT-4 at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education. -
A First Step towards Measuring Interdisciplinary Engagement in Scientific Publications: A Case Study on NLP + CSS Research
Alexandria Leto, Shamik Roy, Alexander Hoyle, Daniel Acuna and Maria Leonor Pacheco. In NLP+CSS Workshop. 2024
[Link]
Abstract
With the rise in the prevalence of cross-disciplinary research, there is a need to develop methods to characterize its practices. Current computational methods to evaluate interdisciplinary engagement{---}such as affiliation diversity, keywords, and citation patterns{---}are insufficient to model the degree of engagement between disciplines, as well as the way in which the complementary expertise of co-authors is harnessed. In this paper, we propose an automated framework to address some of these issues on a large scale. Our framework tracks interdisciplinary citations in scientific articles and models: 1) the section and position in which they appear, and 2) the argumentative role that they play in the writing. To showcase our framework, we perform a preliminary analysis of interdisciplinary engagement in published work at the intersection of natural language processing and computational social science in the last decade.
2023
-
Natural Language Decompositions of Implicit Content Enable Better Text Representations
Alexander Hoyle, Rupak Sarkar, Pranav Goel and Philip Resnik. In EMNLP. 2023
[Link]
Abstract
When people interpret text, they rely on inferences that go beyond the observed language itself. Inspired by this observation, we introduce a method for the analysis of text that takes implicitly communicated content explicitly into account. We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed, then validate the plausibility of the generated content via human judgments. Incorporating these explicit representations of implicit content proves useful in multiple problem settings that involve the human interpretation of utterances: assessing the similarity of arguments, making sense of a body of opinion data, and modeling legislative behavior. Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP and particularly its applications to social science. -
Revisiting Automated Topic Model Evaluation with Large Language Models
Dominik Stammbach, Vilém Zouhar, Alexander Hoyle, Mrinmaya Sachan and Elliott Ash. In EMNLP. 2023
[Link]
Abstract
Topic models help us make sense of large text collections. Automatically evaluating their output and determining the optimal number of topics are both longstanding challenges, with no effective automated solutions to date. This paper proposes using large language models (LLMs) for these tasks. We find that LLMs appropriately assess the resulting topics, correlating more strongly with human judgments than existing automated metrics. However, the setup of the evaluation task is crucial {---} LLMs perform better on coherence ratings of word sets than on intrustion detection. We find that LLMs can also assist us in guiding us towards a reasonable number of topics. In actual applications, topic models are typically used to answer a research question related to a collection of texts. We can incorporate this research question in the prompt to the LLM, which helps estimating the optimal number of topics.
2022
-
Are Neural Topic Models Broken?
Alexander Hoyle, Rupak Sarkar, Pranav Goel and Philip Resnik. In Findings of EMNLP. 2022
[Link]
Abstract
Recently, the relationship between automated and human evaluation of topic models has been called into question. Method developers have staked the efficacy of new topic model variants on automated measures, and their failure to approximate human preferences places these models on uncertain ground. Moreover, existing evaluation paradigms are often divorced from real-world use.Motivated by content analysis as a dominant real-world use case for topic modeling, we analyze two related aspects of topic models that affect their effectiveness and trustworthiness in practice for that purpose: the stability of their estimates and the extent to which the model`s discovered categories align with human-determined categories in the data. We find that neural topic models fare worse in both respects compared to an established classical method. We take a step toward addressing both issues in tandem by demonstrating that a straightforward ensembling method can reliably outperform the members of the ensemble.
2021
-
Is Automated Topic Evaluation Broken? The Incoherence of Coherence
Alexander Hoyle, Pranav Goel, Andrew Hian-Cheong, Denis Peskov, Jordan Boyd-Graber and Philip Resnik. In NeurIPS (Spotlight Presentation). 2021
[Link]
Abstract
Topic model evaluation, like evaluation of other unsupervised methods, can be contentious. However, the feld has coalesced around automated estimates of topic coherence, which rely on the frequency of word co-occurrences in a reference corpus. Contemporary neural topic models surpass classical ones according to these metrics. At the same time, topic model evaluation suffers from a validation gap: automated coherence, developed for classical models, has not been validated using human experimentation for neural models. In addition, a meta-analysis of topic modeling literature reveals a substantial standardization gap in automated topic modeling benchmarks. To address the validation gap, we compare automated coherence with the two most widely accepted human judgment tasks: topic rating and word intrusion. To address the standardization gap, we systematically evaluate a dominant classical model and two state-of-the-art neural models on two commonly used datasets. Automated evaluations declare a winning model when corresponding human evaluations do not, calling into question the validity of fully automatic evaluations independent of human judgments. -
Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?
Pedro Rodriguez, Joe Barrow, Alexander Hoyle, John P. Lalor, Robin Jia and Jordan Boyd-Graber. In ACL. 2021
[Link]
Abstract
Leaderboards are widely used in NLP and push the field forward. While leaderboards are a straightforward ranking of NLP models, this simplicity can mask nuances in evaluation items (examples) and subjects (NLP models). Rather than replace leaderboards, we advocate a re-imagining so that they better highlight if and where progress is made. Building on educational testing, we create a Bayesian leaderboard model where latent subject skill and latent item difficulty predict correct responses. Using this model, we analyze the ranking reliability of leaderboards. Afterwards, we show the model can guide what to annotate, identify annotation errors, detect overfitting, and identify informative examples. We conclude with recommendations for future benchmark tasks. -
Promoting Graph Awareness in Linearized Graph-to-Text Generation
Alexander Hoyle, Ana Marasović and Noah A. Smith. In Findings of NAACL. 2021
[Link]
2020
-
Improving Neural Topic Models using Knowledge Distillation
Alexander Hoyle, Pranav Goel and Philip Resnik. In EMNLP. 2020
[Link] [Video]
Abstract
Topic models are often used to identify human-interpretable topics to help make sense of large document collections. We use knowledge distillation to combine the best attributes of probabilistic topic models and pretrained transformers. Our modular method can be straightforwardly applied with any neural topic model to improve topic quality, which we demonstrate using two models having disparate architectures, obtaining state-of-the-art topic coherence. We show that our adaptable framework not only improves performance in the aggregate over all estimated topics, as is commonly reported, but also in head-to-head comparisons of aligned topics.
2019
-
Combining Sentiment Lexica with a Multi-View Variational Autoencoder
Alexander Hoyle, Lawrence Wolf-Sonkin, Hanna Wallach, Ryan Cotterell and Isabelle Augenstein. In NAACL. 2019
[Link] [Slides]
Abstract
When assigning quantitative labels to a dataset, different methodologies may rely on different scales. In particular, when assigning polarities to words in a sentiment lexicon, annotators may use binary, categorical, or continuous labels. Naturally, it is of interest to unify these labels from disparate scales to both achieve maximal coverage over words and to create a single, more robust sentiment lexicon while retaining scale coherence. We introduce a generative model of sentiment lexica to combine disparate scales into a common latent representation. We realize this model with a novel multi-view variational autoencoder (VAE), called SentiVAE. We evaluate our approach via a downstream text classification task involving nine English-Language sentiment analysis datasets; our representation outperforms six individual sentiment lexica, as well as a straightforward combination thereof. -
Unsupervised Discovery of Gendered Language through Latent-Variable Modeling
Alexander Hoyle, Lawrence Wolf-Sonkin, Hanna Wallach, Isabelle Augenstein and Ryan Cotterell. In ACL. 2019
[Link] [Slides]
Abstract
Studying the ways in which language is gendered has long been an area of interest in sociolinguistics. Studies have explored, for example, the speech of male and female characters in film and the language used to describe male and female politicians. In this paper, we aim not to merely study this phenomenon qualitatively, but instead to quantify the degree to which the language used to describe men and women is different and, moreover, different in a positive or negative way. To that end, we introduce a generative latent-variable model that jointly represents adjective (or verb) choice, with its sentiment, given the natural gender of a head (or dependent) noun. We find that there are significant differences between descriptions of male and female nouns and that these differences align with common gender stereotypes: Positive adjectives used to describe women are more often related to their bodies than adjectives used to describe men.
2018
-
Citation Detected: Automated Claim Detection through Natural Language Processing
Alexander Hoyle. University College London, Master's Thesis. 2018
Service, Interests
- At UMD, I was involved in graduate labor advocacy and organizing. I chaired the Data and Research group of the Graduate Assistant Advisory Committee, and I testified on behalf of graduate assistants before the Maryland state legislature.
- I participate in Científico Latino's Graduate Student Mentorship Initiative. Mentees have been accepted to the residency program at Meta AI and UCSD. If you are applying to schools, please contact me and I will be happy to read your materials!
- In my spare time, I like to read, run, cycle, cook, and talk.