| CARVIEW |
Nitish Joshi
I recently joined Google Deepmind as a Research Scientist to work on Gemini post-training.
I completed my PhD at New York University where I was advised by Prof. He He in the ML2 research group. My research was supported by NSF Graduate Research Fellowship and NYU Dean's Dissertation Fellowship. During PhD, I spent fun summers interning at Google Gemini/Bard and Amazon AWS. Previously, I completed my undergraduate degree in Computer Science at IIT Bombay where I did research with Preethi Jyothi and Mohit Bansal (at UNC Chapel Hill).
Email: joshinh@gmail.com / nitish@nyu.edu
Links: [CV] [Twitter] [Github] [Google Scholar]
Publications
-
Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Xinpeng Wang*, Nitish Joshi*, Barbara Plank, Rico Angell, He He
Preprint, 2025
[bib] [abstract]@article{wang2025hacking, title={Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort}, author={Xinpeng Wang and Nitish Joshi and Barbara Plank and Rico Angell and He He}, year={2025}, eprint={2510.01367}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2510.01367}, }Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model's chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less 'effort' than required to achieve high reward. TRACE quantifies effort by measuring how early a model's reasoning becomes sufficient to obtain the reward. We progressively truncate a model's CoT at various lengths, force the model to answer, and estimate the expected reward at each cutoff. A hacking model, which takes a shortcut, will achieve a high expected reward with only a small fraction of its CoT, yielding a large area under the accuracy-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.
-
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, He He
Preprint, 2025
[bib] [abstract]@article{YuehHan2025MonitoringDA, title={Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors}, author={Chen Yueh-Han and Nitish Joshi and Yulin Chen and Maksym Andriushchenko and Rico Angell and He He}, journal={ArXiv}, year={2025}, volume={abs/2506.10949}, }Current LLM safety defenses fail under decomposition attacks, where a malicious goal is decomposed into benign subtasks that circumvent refusals. The challenge lies in the existing shallow safety alignment techniques: they only detect harm in the immediate prompt and do not reason about long-range intent, leaving them blind to malicious intent that emerges over a sequence of seemingly benign instructions. We therefore propose adding an external monitor that observes the conversation at a higher granularity. To facilitate our study of monitoring decomposition attacks, we curate the largest and most diverse dataset to date, including question-answering, text-to-image, and agentic tasks. We verify our datasets by testing them on frontier LLMs and show an 87% attack success rate on average on GPT-4o. This confirms that decomposition attack is broadly effective. Additionally, we find that random tasks can be injected into the decomposed subtasks to further obfuscate malicious intents. To defend in real time, we propose a lightweight sequential monitoring framework that cumulatively evaluates each subtask. We show that a carefully prompt engineered lightweight monitor achieves a 93% defense success rate, beating reasoning models like o3 mini as a monitor. Moreover, it remains robust against random task injection and cuts cost by 90% and latency by 50%. Our findings suggest that lightweight sequential monitors are highly effective in mitigating decomposition attacks and are viable in deployment.
-
Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
Anirudh Bharadwaj, Chaitanya Malaviya, Nitish Joshi, Mark Yatskar
Preprint, 2025
[bib] [abstract]@article{Bharadwaj2025FlatteryFA, title={Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models}, author={Anirudh Bharadwaj and Chaitanya Malaviya and Nitish Joshi and Mark Yatskar}, journal={ArXiv}, year={2025}, volume={abs/2506.05339}, }Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. Evidence suggests these biases originate in artifacts in human training data. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with magnified biases (skew), finding this preference occurs in >60% of instances, and model preferences show high miscalibration (~40%) compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean r_human = -0.12) but show moderately strong positive correlations with labels from a strong reward model (mean r_model = +0.36), suggesting that models may overrely on spurious cues. To mitigate these issues, we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Finetuning models with CDA reduces average miscalibration from 39.4% to 32.5% and average absolute skew difference from 20.5% to 10.0%, while maintaining overall RewardBench performance, showing that targeted debiasing is effective for building reliable preference models.
-
Transformers Struggle to Learn to Search
Abulhair Saparov, Srushti Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, He He
ICLR 2025
[bib] [abstract]@inproceedings{ saparov2025transformers, title={Transformers Struggle to Learn to Search}, author={Abulhair Saparov and Srushti Ajay Pawar and Shreyas Pimpalgaonkar and Nitish Joshi and Richard Yuanzhe Pang and Vishakh Padmakumar and Mehran Kazemi and Najoung Kim and He He}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=9cQB1Hwrtw} }Search is an ability foundational in many important tasks, and recent studies have shown that large language models (LLMs) struggle to perform search robustly. It is unknown whether this inability is due to a lack of data, insufficient model parameters, or fundamental limitations of the transformer architecture. In this work, we use the foundational graph connectivity problem as a testbed to generate effectively limitless high-coverage data to train small transformers and test whether they can learn to perform search. We find that, when given the right training distribution, the transformer is able to learn to search. We analyze the algorithm that the transformer has learned through a novel mechanistic interpretability technique that enables us to extract the computation graph from the trained model. We find that for each vertex in the input graph, transformers compute the set of vertices reachable from that vertex. Each layer then progressively expands these sets, allowing the model to search over a number of vertices exponential in the number of layers. However, we find that as the input graph size increases, the transformer has greater difficulty in learning the task. This difficulty is not resolved even as the number of parameters is increased, suggesting that increasing model scale will not lead to robust search abilities. We also find that performing search in-context (i.e., chain-of-thought) does not resolve this inability to learn to search on larger graphs.
-
LLMs Are Prone to Fallacies in Causal Inference
Nitish Joshi, Abulhair Saparov, Yixin Wang, He He
EMNLP 2024
[bib] [abstract]@article{joshi2024llmspronefallaciescausal, title={LLMs Are Prone to Fallacies in Causal Inference}, author={Nitish Joshi and Abulhair Saparov and Yixin Wang and He He}, journal={arXiv preprint arXiv:2406.12158}, year={2024}, }Recent work shows that causal facts can be effectively extracted from LLMs through prompting, facilitating the creation of causal graphs for causal inference tasks. However, it is unclear if this success is limited to explicitly-mentioned causal facts in the pretraining data which the model can memorize. Thus, this work investigates: Can LLMs infer causal relations from other relational data in text? To disentangle the role of memorized causal facts vs inferred causal relations, we finetune LLMs on synthetic data containing temporal, spatial and counterfactual relations, and measure whether the LLM can then infer causal relations. We find that: (a) LLMs are susceptible to inferring causal relations from the order of two entity mentions in text (e.g. X mentioned before Y implies X causes Y); (b) if the order is randomized, LLMs still suffer from the post hoc fallacy, i.e. X occurs before Y (temporal relation) implies X causes Y. We also find that while LLMs can correctly deduce the absence of causal relations from temporal and spatial relations, they have difficulty inferring causal relations from counterfactuals, questioning their understanding of causality.
-
Personas as a Way to Model Truthfulness in Language Models
Nitish Joshi*, Javier Rando*, Abulhair Saparov, Najoung Kim, He He
EMNLP 2024
[bib] [abstract]@article{joshi2023personas, title={Personas as a Way to Model Truthfulness in Language Models}, author={Nitish Joshi and Javier Rando and Abulhair Saparov and Najoung Kim and He He}, journal={arXiv preprint arXiv:2310.18168}, year={2023} }Large Language Models are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different agents producing the corpora, we hypothesize that they can cluster truthful text by modeling a truthful persona: a group of agents that are likely to produce truthful text and share similar features. For example, trustworthy sources like Wikipedia and Science usually use formal writing styles and make consistent claims. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent "Wikipedia" will behave truthfully on topics that were only generated by "Science" because they share a persona. We first show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.
-
Nuisances via Negativa: Adjusting for Spurious Correlations via Data Augmentation
Aahlad Puli, Nitish Joshi, Yoav Wald, He He, and Rajesh Ranganath
TMLR, 2024
[bib] [abstract]@article{Puli2022NuisancesVN, title={Nuisances via Negativa: Adjusting for Spurious Correlations via Data Augmentation}, author={Aahlad Manas Puli and Nitish Joshi and He He and Rajesh Ranganath}, journal={ArXiv}, year={2022}, volume={abs/2210.01302} }There exist features that are related to the label in the same way across different settings for that task; these are semantic features or semantics. Features with varying relationships to the label are nuisances. For example, in detecting cows from natural images, the shape of the head is a semantic and because images of cows often have grass backgrounds but only in certain settings, the background is a nuisance. Relationships between a nuisance and the label are unstable across settings and, consequently, models that exploit nuisance-label relationships face performance degradation when these relationships change. Direct knowledge of a nuisance helps build models that are robust to such changes, but knowledge of a nuisance requires extra annotations beyond the label and the covariates. In this paper, we develop an alternative way to produce robust models by data augmentation. These data augmentations corrupt semantic information to produce models that identify and adjust for where nuisances drive predictions. We study semantic corruptions in powering different robust-modeling methods for multiple out-of distribution (OOD) tasks like classifying waterbirds, natural language inference, and detecting Cardiomegaly in chest X-rays.
-
Improving Multi-Hop Reasoning in LLMs by Learning from Rich Human Feedback
Nitish Joshi, Koushik Kalyanaraman, Zhiting Hu, Kumar Chellapilla, He He, Li Erran Li
NucLeaR Workshop, AAAI 2024
[bib] [abstract]@inproceedings{JoshiImprovingMR, title={Improving Multi-Hop Reasoning in LLMs by Learning from Rich Human Feedback}, author={Nitish Joshi and Koushik Kalyanaraman and Zhiting Hu and Kumar Chellapilla and He He and Li Erran and Li}, url={https://api.semanticscholar.org/CorpusID:267761309} }Recent large language models (LLMs) have enabled tremendous progress in natural language understanding. However, they are prone to generating confident but nonsensical explanations, which poses a significant obstacle to establishing trust with users. In this post, we show how to incorporate human feedback on the incorrect reasoning chains for multi-hop reasoning to improve performance on these tasks. Instead of collecting the reasoning chains from scratch by asking humans, we instead learn from rich human feedback on model-generated reasoning chains using the prompting abilities of the LLMs. We collect two such datasets of human feedback in the form of (correction, explanation, error type) for StrategyQA and Sports Understanding datasets, and evaluate several common algorithms to learn from such feedback. Our proposed methods perform competitively to chain-of-thought prompting using the base Flan-T5, and ours is better at judging the correctness of its own answer.
-
Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Seyed Mehran Kazemi, Najoung Kim*, He He*
NeurIPS 2023
[bib] [abstract]@article{saparov2023testing, title={Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples}, author={Saparov, Abulhair and Pang, Richard Yuanzhe and Padmakumar, Vishakh and Joshi, Nitish and Kazemi, Seyed Mehran and Kim, Najoung and He, He}, journal={arXiv preprint arXiv:2305.15269}, year={2023} }Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to longer and compositional proofs. However, they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.
-
Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations
Chenglei Si*, Dan Friedman*, Nitish Joshi, Shi Feng, Danqi Chen, He He
ACL 2023
[bib] [abstract]@article{chenglei2023inductive, title={Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations}, author={Si, Chenglei and Friedman, Dan and Joshi, Nitish and Feng, Shi and Chen, Danqi and He, He}, booktitle={Association for Computational Linguistics (ACL)}, year={2023} }In-context learning (ICL) is an emergent paradigm for adapting large language models (LLMs) to new tasks, but the generalization behavior of ICL remains poorly understood. We investigate the inductive biases of ICL from the perspective of feature bias: which features ICL is more likely to use given a set of underspecified demonstrations in which two features are equally predictive of the labels. First, we characterize the feature biases of GPT-3 models by constructing underspecified demonstrations from a range of NLP datasets and feature combinations. We find that LLMs exhibit clear feature biases---for example, demonstrating a strong bias to predict labels according to sentiment rather than shallow lexical features, like punctuation. Second, we evaluate the effect of different interventions that are designed to impose an inductive bias in favor of a particular feature, such as adding a natural-language instruction or using semantically relevant label words. We find that, while many interventions can influence the learner to prefer a particular feature, it can be difficult to overcome strong feature biases. Overall, our results provide a broader picture of the kinds of features ICL may be more likely to exploit, and how to impose inductive biases that are better aligned with the intended task.
-
Are All Spurious Features in Natural Language Alike? An Analysis
through a Causal Lens
Nitish Joshi*, Xiang Pan* and He He
EMNLP 2022
[bib] [abstract]@inproceedings{joshi2022spurious, author={Nitish Joshi, Xiang Pan and He He}, title={Are All Spurious Features in Natural Language Alike? An Analysis through a Causal Lens}, booktitle={EMNLP}, year={2022} }The term ‘spurious correlations’ has been used in NLP to informally denote any undesirable feature-label correlations. However, a correlation can be undesirable because, (i) the feature is irrelevant to the label (e.g. punctuation in a review), or (ii) the feature’s effect on the label depends on the context (e.g. negation words in a review), which is ubiquitous in language tasks. While we want the model prediction to be independent of the feature in (i) since it is neither necessary nor sufficient, an ideal model (e.g. humans) must rely on the feature in (ii) since it is necessary but not sufficient. Therefore, a more fine-grained treatment of spurious correlations is needed to address the problem. We formalize this distinction by using a causal model and probabilities of necessity and sufficiency to delineate the causal relations between a feature and a label. We then show that this distinction helps explain results of existing debiasing methods on different spurious features, and demystifies surprising results such as the encoding of spurious features in model representations after debiasing.
-
QuALITY: Question Answering with Long Input Texts, Yes!
Richard Yuanzhe Pang*, Alicia Parrish*, Nitish Joshi*, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, Samuel R. Bowman
NAACL 2022
[bib] [abstract]@article{pang2021quality, title={{QuALITY}: Question Answering with Long Input Texts, Yes!}, author={Pang, Richard Yuanzhe and Parrish, Alicia and Joshi, Nitish and Nangia, Nikita and Phang, Jason and Chen, Angelica and Padmakumar, Vishakh and Ma, Johnny and Thompson, Jana and He, He and Bowman, Samuel R.}, journal={arXiv preprint arXiv:2112.08608}, year={2021} }To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Current models perform poorly on this task (55.4%) and significantly lag behind human performance (93.5%).
-
An Investigation of the (In)effectiveness of Counterfactually Augmented Data
Nitish Joshi, and He He.
ACL 2022
[bib] [abstract]@article{joshi2021investigation, author={Nitish Joshi and He He}, title={An Investigation of the (In)effectiveness of Counterfactually Augmented Data}, journal={arXiv:2107.00753}, year={2021} }While pretrained language models achieve excellent performance on natural language understanding benchmarks, they tend to rely on spurious correlations and generalize poorly to out-of-distribution (OOD) data. Recent work has explored using counterfactually-augmented data (CAD) -- data generated by minimally perturbing examples to flip the ground-truth label -- to identify robust features that are invariant under distribution shift. However, empirical results using CAD for OOD generalization have been mixed. To explain this discrepancy, we draw insights from a linear Gaussian model and demonstrate the pitfalls of CAD. Specifically, we show that (a) while CAD is effective at identifying robust features, it may prevent the model from learning unperturbed robust features, and (b) CAD may exacerbate existing spurious correlations in the data. Our results show that the lack of perturbation diversity in current CAD datasets limits its effectiveness on OOD generalization, calling for innovative crowdsourcing procedures to elicit diverse perturbation of examples.
-
Coupled Training of Sequence-to-Sequence Models for Accented Speech Recognition
Vinit Unni*, Nitish Joshi*, and Preethi Jyothi.
ICASSP 2020
[bib] [abstract]@inproceedings{UnniJoshi2020coupled, author={Vinit Unni, Nitish Joshi, and Preethi Jyothi}, booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={Coupled Training of Sequence-to-Sequence Models for Accented Speech Recognition}, year={2020}, }Accented speech poses significant challenges for state-of-the-art automatic speech recognition (ASR) systems. Accent is a property of speech that lasts throughout an utterance in varying degrees of strength. This makes it hard to isolate the influence of accent on individual speech sounds. We propose coupled training for encoder-decoder ASR models that acts on pairs of utterances corresponding to the same text spoken by speakers with different accents. This training regime introduces an L2 loss between the attention-weighted representations corresponding to pairs of utterances with the same text, thus acting as a regularizer and encouraging representations from the encoder to be more accent-invariant. We focus on recognizing accented English samples from the Mozilla Common Voice corpus. We obtain significant error rate reductions on accented samples from a large set of diverse accents using coupled training. We also show consistent improvements in performance on heavily accented samples (as determined by a standalone accent classifier).
-
Explore, Propose and Assemble: An Interpretable Model for Multi-hop Reading Comprehension
Yichen Jiang*, Nitish Joshi*, Yen-Chun Chen and Mohit Bansal
ACL 2019
[bib] [abstract] [code]@inproceedings{JiangJoshi2019epar, author={Yichen Jiang, Nitish Joshi, Yen-chun Chen and Mohit Bansal}, booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, title={Explore, Propose, and Assemble: An Interpretable Model for Multi-Hop Reading Comprehension}, year={2019} }Multi-hop reading comprehension requires the model to explore and connect relevant information from multiple sentences/documents in order to answer the question about the context. To achieve this, we propose an interpretable 3-module system called Explore-Propose-Assemble reader (EPAr). First, the Document Explorer iteratively selects relevant documents and represents divergent reasoning chains in a tree structure so as to allow assimilating information from all chains. The Answer Proposer then proposes an answer from every root-to-leaf path in the reasoning tree. Finally, the Evidence Assembler extracts a key sentence containing the proposed answer from every path and combines them to predict the final answer. Intuitively, EPAr approximates the coarse-to-fine-grained comprehension behavior of human readers when facing multiple long documents. We jointly optimize our 3 modules by minimizing the sum of losses from each stage conditioned on the previous stage’s output. On two multi-hop reading comprehension datasets WikiHop and MedHop, our EPAr model achieves significant improvements over the baseline and competitive results compared to the state-of-the-art model. We also present multiple reasoning-chain-recovery tests and ablation studies to demonstrate our system’s ability to perform interpretable and accurate reasoning.
-
Cross-lingual Training for Automatic Question Generation
Vishwajeet Kumar, Nitish Joshi, Arijit Mukherjee, Ganesh Ramakrishnan, and Preethi Jyothi.
ACL 2019
[bib] [abstract] [dataset]@inproceedings{kumar-etal-2019-cross, author={Vishwajeet Kumar, Nitish Joshi, Arijit Mukherjee, Ganesh Ramakrishnan and Preethi Jyothi}, booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, title={Cross-Lingual Training for Automatic Question Generation}, year={2019} }Automatic question generation (QG) is a challenging problem in natural language understanding. QG systems are typically built assuming access to a large number of training instances where each instance is a question and its corresponding answer. For a new language, such training instances are hard to obtain making the QG problem even more challenging. Using this as our motivation, we study the reuse of an available large QG dataset in a secondary language (e.g. English) to learn a QG model for a primary language (e.g. Hindi) of interest. For the primary language, we assume access to a large amount of monolingual text but only a small QG dataset. We propose a cross-lingual QG model which uses the following training regime: (i) Unsupervised pretraining of language models in both primary and secondary languages and (ii) joint supervised training for QG in both languages. We demonstrate the efficacy of our proposed approach using two different primary languages, Hindi and Chinese. Our proposed framework clearly outperforms a number of baseline models, including a fully-supervised transformer-based model trained on the QG datasets in the primary language. We also create and release a new question answering dataset for Hindi consisting of 6555 sentences.
Miscellany
- In my free time, I like to do birding, go for a run and read books.
- The source code for this website was borrowed from Nelson Liu (https://nelsonliu.me)