| CARVIEW |
Bio
Hi! I am Nina, a 4th-year PhD student at the Max Planck Institute for Software Systems where I am advised by Dr. Manuel Gomez Rodriguez and at ETH Zurich where I am co-advised by Prof. Dr. Niko Beerenwinkel under the ELLIS PhD Program. My research interests lie in the design of collaborative decision-support systems to improve human-decision making, leveraging tools from machine learning, causality and uncertainty estimation. My research is particularly motivated by the application of such systems in healthcare in order to improve patient care.
Publications
Most recent publications on Google Scholar.
‡ indicates equal contribution.
- Selected
- All
Towards Human-AI Complementarity in Matching Tasks
Adrian Arnaiz-Rodriguez, Nina Corvelo Benz, Suhas Thejaswi, Nuria Oliver, and Manuel Gomez-Rodriguez
ArXiv preprint, 2025.
Data-driven algorithmic matching systems promise to help human decision makers make better matching decisions in a wide variety of high-stakes application domains, such as healthcare and social service provision. However, existing systems are not designed to achieve human-AI complementarity: decisions made by a human using an algorithmic matching system are not necessarily better than those made by the human or by the algorithm alone. Our work aims to address this gap. To this end, we propose collaborative matching (comatch), a data-driven algorithmic matching system that takes a collaborative approach: rather than making all the matching decisions for a matching task like existing systems, it selects only the decisions that it is the most confident in, deferring the rest to the human decision maker. In the process, comatch optimizes how many decisions it makes and how many it defers to the human decision maker to provably maximize performance. We conduct a large-scale human subject study with 800 participants to validate the proposed approach. The results demonstrate that the matching outcomes produced by comatch outperform those generated by either human participants or by algorithmic matching on their own. The data gathered in our human subject study and an implementation of our system are available as open source at https://github.com/Networks-Learning/human-AI-complementarity-matching .
Human-Alignment Influences the Utility of AI-assisted Decision Making
Nina Corvelo Benz and Manuel Gomez-Rodriguez
Scientific Reports, 2025.
Whenever an AI model is used to predict a relevant (binary) outcome in AI-assisted decision making, it is widely agreed that, together with each prediction, the model should provide an AI confidence value. However, it has been unclear why decision makers have often difficulties to develop a good sense on when to trust a prediction using AI confidence values. Very recently, Corvelo Benz and Gomez Rodriguez have argued that, for rational decision makers, the utility of AI-assisted decision making is inherently bounded by the degree of alignment between the AI confidence values and the decision maker’s confidence on their own predictions. In this work, we empirically investigate to what extent the degree of alignment actually influences the utility of AI-assisted decision making. To this end, we design and run a large-scale human subject study (n = 703) where participants solve a simple decision making task—an online card game—assisted by an AI model with a steerable degree of alignment. Our results show a positive association between the degree of alignment and the utility of AI-assisted decision making. In addition, our results also show that post-processing the AI confidence values to achieve multicalibration with respect to the participants’ confidence on their own predictions increases both the degree of alignment and the utility of AI-assisted decision making.
Detecting Antimicrobial Resistance Through MALDI-TOF Mass Spectrometry with Statistical Guarantees Using Conformal Prediction
Nina Corvelo Benz‡, Lucas Miranda‡, Dexiong Chen, Karsten Borgwardt, Janko Sattler
29th Conference on Research in Computational Molecular Biology, 2025
Antimicrobial resistance (AMR) is a global health challenge, complicating the treatment of bacterial infections and leading to higher patient morbidity and mortality. Rapid and reliable identifi- cation of resistant pathogens is crucial to guide early and effective therapeutic interventions. However, traditional culture-based methods are time-consuming, highlighting the need for faster predictive ap- proaches. Machine learning models trained on MALDI-TOF mass spectrometry data, readily collected in most clinics for fast species identification, offer promise but face limitations in clinical applicabil- ity, particularly due to their lack of comprehensive, statistically valid uncertainty estimates. Here, we introduce a novel AMR prediction framework that addresses this gap with a novel knowledge graph- enhanced conformal predictor. Conformal prediction (CP) constructs prediction sets with statistical coverage guarantees, ensuring that bacterial resistance to a certain antibiotic is flagged with a specified error rate. Our proposed conformal predictor constructs improved prediction sets over standard CP ap- proaches by using a knowledge graph capturing the interdependencies in antibiotic resistance patterns. In addition, we introduce a novel classifier framework that improves upon previous multimodal mod- els by incorporating multigraph-based antibiotic representations using state-of-the-art self-supervised methods. Besides increasing resistance detection for most tested species-drug combinations, the pre- sented architecture, termed ResMLP-GNN, overcomes the limitations of previous efforts and supports multi-drug antibiotics that are highly relevant in clinical practice. We successfully evaluated our ap- proach on a set of highly relevant antibiotics, commonly used in clinics to treat infections with Klebsiella pneumoniae and Escherichia coli.
Human-Aligned Calibration for AI-Assisted Decision Making
Nina Corvelo Benz and Manuel Gomez-Rodriguez
37th Conference on Neural Information Processing Systems, 2023
Whenever a binary classifier is used to provide decision support, it typically provides both a label prediction and a confidence value. Then, the decision maker is supposed to use the confidence value to calibrate how much to trust the prediction. In this context, it has been often argued that the confidence value should correspond to a well calibrated estimate of the probability that the predicted label matches the ground truth label. However, multiple lines of empirical evidence suggest that decision makers have difficulties at developing a good sense on when to trust a prediction using these confidence values. In this paper, our goal is first to understand why and then investigate how to construct more useful confidence values. We first argue that, for a broad class of utility functions, there exist data distributions for which a rational decision maker is, in general, unlikely to discover the optimal decision policy using the above confidence values—an optimal decision maker would need to sometimes place more (less) trust on predictions with lower (higher) confidence values. However, we then show that, if the confidence values satisfy a natural alignment property with respect to the decision maker’s confidence on her own predictions, there always exists an optimal decision policy under which the level of trust the decision maker would need to place on predictions is monotone on the confidence values, facilitating its discoverability. Further, we show that multicalibration with respect to the decision maker’s confidence on her own predictions is a sufficient condition for alignment. Experiments on four different AI-assisted decision making tasks where a classifier provides decision support to real human experts validate our theoretical results and suggest that alignment may lead to better decisions.
Towards Human-AI Complementarity in Matching Tasks
Adrian Arnaiz-Rodriguez, Nina Corvelo Benz, Suhas Thejaswi, Nuria Oliver, and Manuel Gomez-Rodriguez
ArXiv preprint, 2025.
Data-driven algorithmic matching systems promise to help human decision makers make better matching decisions in a wide variety of high-stakes application domains, such as healthcare and social service provision. However, existing systems are not designed to achieve human-AI complementarity: decisions made by a human using an algorithmic matching system are not necessarily better than those made by the human or by the algorithm alone. Our work aims to address this gap. To this end, we propose collaborative matching (comatch), a data-driven algorithmic matching system that takes a collaborative approach: rather than making all the matching decisions for a matching task like existing systems, it selects only the decisions that it is the most confident in, deferring the rest to the human decision maker. In the process, comatch optimizes how many decisions it makes and how many it defers to the human decision maker to provably maximize performance. We conduct a large-scale human subject study with 800 participants to validate the proposed approach. The results demonstrate that the matching outcomes produced by comatch outperform those generated by either human participants or by algorithmic matching on their own. The data gathered in our human subject study and an implementation of our system are available as open source at https://github.com/Networks-Learning/human-AI-complementarity-matching .
Evaluation of Large Language Models via Coupled Token Generation
Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, Suhas Thejaswi, and Manuel Gomez-Rodriguez
ArXiv preprint, 2025.
State of the art large language models rely on randomization to respond to a prompt. As an immediate consequence, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on (human) pairwise comparisons, coupled and vanilla autoregressive generation can surprisingly lead to different rankings when comparing more than two models, even with an infinite amount of samples. This suggests that the apparent advantage of a model over others in existing evaluation protocols may not be genuine but rather confounded by the randomness inherent to the generation process. To illustrate and complement our theoretical results, we conduct experiments with several large language models from the Llama family. We find that, across multiple knowledge areas from the popular MMLU benchmark dataset, coupled autoregressive generation requires up to 40% fewer samples to reach the same conclusions as vanilla autoregressive generation. Further, using data from the LMSYS Chatbot Arena platform, we find that the win-rates derived from pairwise comparisons by a strong large language model to prompts differ under coupled and vanilla autoregressive generation.
Human-Alignment Influences the Utility of AI-assisted Decision Making
Nina Corvelo Benz and Manuel Gomez-Rodriguez
Scientific Reports, 2025.
Whenever an AI model is used to predict a relevant (binary) outcome in AI-assisted decision making, it is widely agreed that, together with each prediction, the model should provide an AI confidence value. However, it has been unclear why decision makers have often difficulties to develop a good sense on when to trust a prediction using AI confidence values. Very recently, Corvelo Benz and Gomez Rodriguez have argued that, for rational decision makers, the utility of AI-assisted decision making is inherently bounded by the degree of alignment between the AI confidence values and the decision maker’s confidence on their own predictions. In this work, we empirically investigate to what extent the degree of alignment actually influences the utility of AI-assisted decision making. To this end, we design and run a large-scale human subject study (n = 703) where participants solve a simple decision making task—an online card game—assisted by an AI model with a steerable degree of alignment. Our results show a positive association between the degree of alignment and the utility of AI-assisted decision making. In addition, our results also show that post-processing the AI confidence values to achieve multicalibration with respect to the participants’ confidence on their own predictions increases both the degree of alignment and the utility of AI-assisted decision making.
Detecting Antimicrobial Resistance Through MALDI-TOF Mass Spectrometry with Statistical Guarantees Using Conformal Prediction
Nina Corvelo Benz‡, Lucas Miranda‡, Dexiong Chen, Karsten Borgwardt, Janko Sattler
29th Conference on Research in Computational Molecular Biology, 2025
Antimicrobial resistance (AMR) is a global health challenge, complicating the treatment of bacterial infections and leading to higher patient morbidity and mortality. Rapid and reliable identifi- cation of resistant pathogens is crucial to guide early and effective therapeutic interventions. However, traditional culture-based methods are time-consuming, highlighting the need for faster predictive ap- proaches. Machine learning models trained on MALDI-TOF mass spectrometry data, readily collected in most clinics for fast species identification, offer promise but face limitations in clinical applicabil- ity, particularly due to their lack of comprehensive, statistically valid uncertainty estimates. Here, we introduce a novel AMR prediction framework that addresses this gap with a novel knowledge graph- enhanced conformal predictor. Conformal prediction (CP) constructs prediction sets with statistical coverage guarantees, ensuring that bacterial resistance to a certain antibiotic is flagged with a specified error rate. Our proposed conformal predictor constructs improved prediction sets over standard CP ap- proaches by using a knowledge graph capturing the interdependencies in antibiotic resistance patterns. In addition, we introduce a novel classifier framework that improves upon previous multimodal mod- els by incorporating multigraph-based antibiotic representations using state-of-the-art self-supervised methods. Besides increasing resistance detection for most tested species-drug combinations, the pre- sented architecture, termed ResMLP-GNN, overcomes the limitations of previous efforts and supports multi-drug antibiotics that are highly relevant in clinical practice. We successfully evaluated our ap- proach on a set of highly relevant antibiotics, commonly used in clinics to treat infections with Klebsiella pneumoniae and Escherichia coli.
Counterfactual Token Generation in Large Language Models
Ivi Chatzi‡, Nina Corvelo Benz‡, Eleni Straitouri‡, Stratis Tsirtsis‡, Manuel Gomez-Rodriguez
4th Conference on Causal Learning and Reasoning, 2025
“Sure, I am happy to generate a story for you: Captain Lyra stood at the helm of her trusty ship, the Maelstrom’s Fury, gazing out at the endless sea. [...] Lyra’s eyes welled up with tears as she realized the bitter truth – she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself.” Although this story, generated by a large language model, is captivating, one may wonder—how would the story have unfolded if the model had chosen “Captain Maeve” as the protagonist instead? We cannot know. State-of-the-art large language models are stateless—they maintain no internal memory or state. Given a prompt, they generate a sequence of tokens as an output using an autoregressive process. As a consequence, they cannot reason about counterfactual alternatives to tokens they have generated in the past. In this work, our goal is to enhance them with this functionality. To this end, we develop a causal model of token generation that builds upon the Gumbel-Max structural causal model. Our model allows any large language model to perform counterfactual token generation at almost no cost in comparison with vanilla token generation, it is embarrassingly simple to implement, and it does not require any fine-tuning nor prompt engineering. We implement our model on Llama 3 8B-Instruct and Ministral-8B-Instruct, and conduct a qualitative and a quantitative analysis of counterfactually generated text. We conclude with a demonstrative application of counterfactual token generation for bias detection, unveiling interesting insights about the model of the world constructed by large language models.
Matchings, Predictions and Counterfactual Harm in Refugee Resettlement Processes
Seungeon Lee, Nina Corvelo Benz, Suhas Thejaswi, Manuel Gomez-Rodriguez
4th Conference on Causal Learning and Reasoning, 2025
Resettlement agencies have started to adopt data-driven algorithmic matching to match refugees to locations using employment rate as a measure of utility. Given a pool of refugees, data-driven algorithmic matching utilizes a classifier to predict the probability that each refugee would find employment at any given location. Then, it uses the predicted probabilities to estimate the expected utility of all possible placement decisions. Finally, it finds the placement decisions that maximize the predicted utility by solving a maximum weight bipartite matching problem. In this work, we argue that, using existing solutions, there may be pools of refugees for which data-driven algorithmic matching is (counterfactually) harmful—it would have achieved lower utility than a given default policy used in the past, had it been used. Then, we develop a post-processing algorithm that, given placement decisions made by a default policy on a pool of refugees and their employment outcomes, solves an inverse matching problem to minimally modify the predictions made by a given classifier. Under these modified predictions, the optimal matching policy that maximizes predicted utility on the pool is guaranteed to be not harmful. Further, we introduce a Transformer model that, given placement decisions made by a default policy on multiple pools of refugees and their employment outcomes, learns to modify the predictions made by a classifier so that the optimal matching policy that maximizes predicted utility under the modified predictions on an unseen pool of refugees is less likely to be harmful than under the original predictions. Experiments on simulated resettlement processes using synthetic refugee data created from publicly available sources suggest that our methodology may be effective in making algorithmic placement decisions that are less likely to be harmful than existing solutions.
Human-Aligned Calibration for AI-Assisted Decision Making
Nina Corvelo Benz and Manuel Gomez-Rodriguez
37th Conference on Neural Information Processing Systems, 2023
Whenever a binary classifier is used to provide decision support, it typically provides both a label prediction and a confidence value. Then, the decision maker is supposed to use the confidence value to calibrate how much to trust the prediction. In this context, it has been often argued that the confidence value should correspond to a well calibrated estimate of the probability that the predicted label matches the ground truth label. However, multiple lines of empirical evidence suggest that decision makers have difficulties at developing a good sense on when to trust a prediction using these confidence values. In this paper, our goal is first to understand why and then investigate how to construct more useful confidence values. We first argue that, for a broad class of utility functions, there exist data distributions for which a rational decision maker is, in general, unlikely to discover the optimal decision policy using the above confidence values—an optimal decision maker would need to sometimes place more (less) trust on predictions with lower (higher) confidence values. However, we then show that, if the confidence values satisfy a natural alignment property with respect to the decision maker’s confidence on her own predictions, there always exists an optimal decision policy under which the level of trust the decision maker would need to place on predictions is monotone on the confidence values, facilitating its discoverability. Further, we show that multicalibration with respect to the decision maker’s confidence on her own predictions is a sufficient condition for alignment. Experiments on four different AI-assisted decision making tasks where a classifier provides decision support to real human experts validate our theoretical results and suggest that alignment may lead to better decisions.
Counterfactual Inference of Second Opinions
Nina Corvelo Benz and Manuel Gomez-Rodriguez
38th Conference on Uncertainty in Artificial Intelligence, 2022
Automated decision support systems that are able to infer second opinions from experts can potentially facilitate a more efficient allocation of resources—they can help decide when and from whom to seek a second opinion. In this paper, we look at the design of this type of support systems from the perspective of counterfactual inference. We focus on a multiclass classification setting and first show that, if experts make predictions on their own, the underlying causal mechanism generating their predictions needs to satisfy a desirable set invariant property. Further, we show that, for any causal mechanism satisfying this property, there exists an equivalent mechanism where the predictions by each expert are generated by independent sub-mechanisms governed by a common noise. This motivates the design of a set invariant GumbelMax structural causal model where the structure of the noise governing the sub-mechanisms underpinning the model depends on an intuitive notion of similarity between experts which can be estimated from data. Experiments on both synthetic and real data show that our model can be used to infer second opinions more accurately than its non-causal counterpart.
Call admission problems on trees
Hans-Joachim Böckenhauer, Nina Corvelo Benz, Dennis Komm (A)
Theoretical Computer Science, Volume 922, 2022
Resume
-
MPI for Software Systems, Germany 2021 - presentPh.D. Student in Computer Science
Human-Centric Machine Learning -
Oracle Labs Switzerland June-Nov 2020Research Scientist, Intern
-
ETH Zurich, Switzerland 2018 - 2020M.Sc. Student
Computing Science -
POSTECH, South Korea Feb-June 2017Exchange Program
Computer Science -
ETH Zurich, Switzerland 2014 - 2018B.Sc. Student
Computing Science
