Nishant Balepur
Ph.D. Student in Computer Science at University of Maryland, College Park
Email:
nbalepur[at]umd[dot]edu
Hi! My name is Nishant and I’m a third-year Ph.D. candidate at the University of Maryland, advised by Professors Jordan Boyd-Grayber and Rachel Rudinger. I’m also currently interning with Ai2 to personalize ScholarQA and visiting NYU as a researcher with Eunsol Choi.
Language models are rewarded for being correct and generating responses humans prefer, but systems must do more to actually help users. As a result, I work on designing evaluation, feedback collection, and training protocols that consider user needs, grounded in the following research questions:
- How can we build systems that actually help users? [flashcards, study aids, plans for problem-solving]
- How can we rigorously evaluate models? [reasoning (1, 2), artifacts/shortcuts (3, 4, 5), benchmark errors (6, 7, 8), agents (9, 10)]
- How can we personalize models to individual user needs? [personalized post-training, personalized deep research]
The old Nishant worked on making NLP systems more factual [🤓☝️ 11, 12, 13], but I’m now more interested in research that is helpful for humans and fun to read. If you’re interested in similar problems, don’t hesitate to reach out!
And if you’ve seen another “Balepur, N” during your literature search, you may be looking for my sister 😛
📝 Selected Publications
2025
- ACL 2025Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the AboveIn Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2025Oral at ACL 2025, Best Paper Award and Oral (1.5%) at MASC-SLL 2025
2024
🥳 Research Highlights
| Aug 26, 2025 | We release AstaBench at Ai2, a more rigorous evaluation suite for AI agents. Check out our technical report and leaderboard! |
|---|---|
| Aug 20, 2025 | One paper accepted to EMNLP! We build an interface to help users solve complex questions with plans and show predicting which plans help humans is difficult for humans, reward models, and agents! |
| May 15, 2025 | Two papers accepted to ACL 2025! We design a simple technique to improve DPO’s personalization and make our case for why MCQA is a terrible evaluation format (oral!) |
| May 6, 2025 | I passed my thesis proposal so I’m Ph-Done! (with being a regular student as I am now a candidate 🤓☝️). Fun fact: my sister and I proposed our theses on the same day 😁 |
| Apr 5, 2025 | Excited to give an oral presentation on why MCQA sucks at MASC-SLL 2025. Also humbled to win a best paper award! |
| Mar 24, 2025 | Humbled to be invited for talks at Imperial College London on building Helpful QA systems (slides) and Google Translate’s Reading Group on improving MCQA evals (slides) |
😔 Negative Results
| Aug 8, 2025 | One paper got bad reviews at EMNLP 2025, then desk rejected from AAAI 2025 (never adding an Appendix again smh) |
|---|---|
| Jul 7, 2025 | One paper rejected from COLM 2025 💪 |
| Jun 11, 2025 | Our Schmidt Science Expression of Interest for AI Safety in the Inference-Time Compute Paradigm was rejected |
| Feb 13, 2025 | One paper got bad reviews in December ARR |
| Dec 19, 2024 | Didn’t get intern/fellow offers after interviewing at Meta, Cohere, and Anthropic |
| Jun 15, 2024 | KAR³L is on its fourth resubmission 🫡 |
| Apr 15, 2024 | One paper not committed to ACL 2024 |
| Feb 15, 2024 | Two papers not committed to NAACL 2024 |
| Feb 10, 2024 | Banned on r/ACT for trying to advertise our KAR³L user study 😭 |
| Oct 6, 2023 | One paper rejected from EMNLP 2023 |
| Mar 20, 2023 | My first ever review score of 1 recieved on an ARR submission |