My interests span across the areas of Natural Language Understanding and Information Retrieval, with a focus on Question Answering (QA), Interpretability, and Collaborative Agents. Currently, my research is concentrated in two primary areas:
Analyzing Large Language Model (LLM)-based agents in terms of their skills and knowledge, alongside the complexity of questions, to curate a minimal yet comprehensive set for profiling agents within a Mixture of Experts (MoE) framework.
Developing (Retrieval) Augmented Language Models that are both aware of their environment and their parametric knowledge for producing reliable and helpful responses.
In the past, I’ve spent some time working at Google Research where I collaborated with Brain and several Language Research teams, with focus on Model Interpretation and Analysis for Question Answering, Semi-Structured Text Understanding and Retrieval-augmented Language models for long-context understanding.
I have also worked on some of the Computer vision and Machine Learning problems like Human motion sequence modeling, Generative and Representation Learning and Adversarial Machine Learning.
In my free time, I like to engage in social deception board games, and I also very much like to play Magic the Gathering :)
Adversarial datasets should ensure AI robustness that matches human performance. However, as models evolve, datasets can become obsolete. Thus, adversarial datasets should be periodically updated based on their degradation in adversarialness. Given the lack of a standardized metric for measuring adversarialness, we propose AdvScore, a human-grounded evaluation metric. AdvScore assesses a dataset’s true adversarialness by capturing models’ and humans’ varying abilities, while also identifying poor examples. AdvScore then motivates a new dataset creation pipeline for realistic and high-quality adversarial samples, enabling us to collect an adversarial question answering (QA) dataset, AdvQA. We apply AdvScore using 9,347 human responses and ten language model predictions to track the models’ improvement over five years (from 2020 to 2024). AdvScore assesses whether adversarial datasets remain suitable for model evaluation, measures model improvements, and provides guidance for better alignment with human capabilities.
@inproceedings{sung2024advscore,title={Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness},author={Sung, Yoo Yeon and Gor, Maharshi and Fleisig, Eve and Mondal, Ishani and Boyd-Graber, Jordan},publisher={Association for Computational Linguistics,},booktitle={Nations of the Americas Chapter of the Association for Computational Linguistics},year={2025},month=apr,location={Albuquerque, New Mexico},url={https://arxiv.org/abs/2406.16342},}
Recent advancements of large language models (LLMs) have led to claims of AI surpassing humans in natural language processing (NLP)
tasks such as textual understanding and reasoning. This work investigates these assertions by introducing CAIMIRA,
a novel framework rooted in item response theory (IRT) that enables quantitative assessment and comparison of problem-solving abilities
of question-answering (QA) agents: humans and AI systems. Through analysis of over 300,000 responses from 70 AI systems and 155 humans
across thousands of quiz questions, CAIMIRA uncovers distinct proficiency patterns in knowledge domains and reasoning skills.
Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning, while state-of-the-art LLMs like GPT-4 and
LLaMA-3-70B show superior performance on targeted information retrieval and fact-based reasoning, particularly when information
gaps are well-defined and addressable through pattern matching or data retrieval. These findings highlight the need for future QA tasks
to focus on questions that challenge not only higher-order reasoning and scientific thinking, but also demand nuanced linguistic
interpretation and cross-contextual knowledge application, helping advance AI developments that better emulate or complement
human cognitive abilities in real-world problem-solving.
@inproceedings{gor2024caimira,title={Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA},author={Gor, Maharshi and {Daum\'e III}, Hal and Zhou, Tianyi and Boyd-Graber, Jordan},booktitle={Empirical Methods in Natural Language Processing},publisher={Association for Computational Linguistics,},year={2024},month=nov,location={Miami, FL},}
The overwhelming vulnerability of deep neural networks to carefully crafted perturbations known as adversarial attacks has led to the development of various training techniques to produce robust models.
While the primary focus of existing approaches has been directed toward addressing the worst-case performance achieved under a single-threat model, it is imperative that safety-critical systems are robust with respect to multiple threat models simultaneously.
Existing approaches that address worst-case performance under the union of such threat models (e.g., \(\ell_∞), \ell_2, \ell_1) either utilize adversarial training methods that require multi-step attacks which are computationally expensive in practice, or rely upon fine-tuning of pre-trained models that are robust with respect to a single-threat model.
In this work, we show that by carefully choosing the objective function used for robust training, it is possible to achieve similar, or even improved worst-case performance over a union of threat models while utilizing only single-step attacks during the training, thereby achieving a significant reduction in computational resources necessary for training.
Furthermore, prior work showed that adversarial training against the \ell_1 threat model is relatively difficult, to the extent that even multi-step adversarially trained models were shown to be prone to gradient-masking and catastrophic over-fitting.
However, our proposed method—when applied on the \ell_1 threat model specifically—enables us to obtain the first \ell_1$ robust model trained solely with single-step adversarial attacks.
Finally, to demonstrate the merits of our approach, we utilize a modern set of attack evaluations to better estimate the worst-case performance under the considered union of threat models.
@inproceedings{sriramanan:gor:feizi-neurips2022,title={Toward Efficient Robust Training against Union of \(l_p\) Threat Models},author={Sriramanan, Gaurang and Gor, Maharshi and Feizi, Soheil},booktitle={Advances in Neural Information Processing Systems},year={2022},month=dec,location={New Orleans, LA},}
This work presents a sparse-attention Transformer architecture for modeling documents that contain large tables.
Tables are ubiquitous on the web, and are rich in information.
However, more than 20% of relational tables on the web have 20 or more rows (Cafarella et al., 2008), and these large tables present a challenge for current Transformer models, which are typically limited to 512 tokens.
Here we propose MATE, a novel Transformer architecture designed to model the structure of web tables. MATE uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table.
This architecture scales linearly with respect to speed and memory, and can handle documents containing more than 8000 tokens with current accelerators.
MATE also has a more appropriate inductive bias for tabular data, and sets a new state-of-the-art for three table reasoning datasets.
For HybridQA (Chen et al., 2020b), a dataset that involves large documents containing tables, we improve the best prior result by 19 points.
@inproceedings{eisenschlos2021mate,title={MATE: Multi-view Attention for Table Transformer Efficiency},author={Eisenschlos, Julian Martin and Gor, Maharshi and M{\"u}ller, Thomas and Cohen, William Weston},booktitle={Empirical Methods in Natural Language Processing},publisher={Association for Computational Linguistics,},month=nov,year={2021},location={Punta Cana},}
The goal of question answering (QA) is to answer any question. However, major QA datasets have skewed distributions over gender, profession, and nationality. Despite that skew, model accuracy analysis reveals little evidence that accuracy is lower for people based on gender or nationality; instead, there is more variation on professions (question topic). But QA’s lack of representation could itself hide evidence of bias, necessitating QA datasets that better represent global diversity.
@inproceedings{Gor:Webster:Boyd-Graber-2021,title={Toward Deconfounding the Influence of Entity Demographics for Question Answering Accuracy},author={Gor, Maharshi and Webster, Kellie and Boyd-Graber, Jordan},booktitle={Empirical Methods in Natural Language Processing},publisher={Association for Computational Linguistics,},month=nov,year={2021},location={Punta Cana},}