| CARVIEW |
About
Hello! I am a Ph.D. Student at Carnegie Mellon University Language Technologies Institute, co-advised by Graham Neubig and Sean Welleck. I was a research intern at FAIR (Meta) in 2025 and will join again in Summer 2026. I obtained my M.S. in AI at KAIST AI, where I was fortunate to be advised by Minjoon Seo. Prior to that, I was a research intern at NAVER AI Lab and LG AI Research, and did my B.S. in CS at Yonsei University.My primary research focus is centered around LLM Evaluation and AI for Science. Particularly, I aim to develop better evaluation frameworks/benchmarks to systematically identify weaknesses in LLMs and AI scientist agents and develop synthetic data generation/post-training/inference methods to solve the most challenging problems in science and engineering domains.
I am hosting weekly office hours to discuss about research projects or talk about Ph.D. application. Please sign up at this Calendly Link!
News
2026 Mentorship Hosting
I am looking for two mentees to collaborate with throughout 2026. This initiative is intended for newcomers who are in their very early stage of AI research (e.g., undergraduate and master's students who just published their first paper recently), rather than for already experienced researchers. Ideally, mentees are expected to commit around 20 hours per week, and we will work together for approximately 4-6 months with the goal of publishing a paper at a top ML conference (NeurIPS, ICLR, or ICML). If you are interested in one of the two topics below, please send me an email and we can discuss the details more thoroughly!(1) Developing Korean Agentic Benchmarks: I aim to address the question: "How can agentic benchmarks such as WebArena, OSWorld, and TauBench be adapted to evaluate agents in ways that align with Korean culture and Korea-specific contexts?" In particular, the fact that an agent can successfully handle instructions such as "Tell me the full address of all US international airports that are within a driving distance of 60 km to Niagara Falls" does not guarantee that it can competently execute instructions like "교대역에서 신촌역까지 버스랑 택시 중에서 지금 기준으로 어떤게 빠른지 파악하고, 택시가 10분 이상 빠르면 카카오 택시 잡아서 우리 집 앞으로 불러줘." I am therefore interested in developing agentic benchmarks tailored to the Korean context. I am specifically looking for undergraduate or master's students currently enrolled at universities in South Korea.
(2) AI Agents for Fusion Engineering: I recently became interested in a blog post released on Hugging Face and am looking for a student with a background in physics, or a dual major including physics, who would like to study this space together and aim to publish a research paper. Olympiad/competition-style problems such as AIME and IMO have been already conquered by frontier models, and I believe the next step for the AI community is to make meaningful contributions in science and engineering. Nuclear fusion is not only an exceptionally challenging domain but also one with the potential to make a meaningful contribution to humanity, making it a particularly compelling research direction.
Education
Language Technologies Institute, Carnegie Mellon UniversitySep. 2024 - Present
Ph.D. in Computer Science (Advisors: Graham Neubig and Sean Welleck)
KAIST AIMar. 2023 - Aug. 2024
M.S. in Artificial Intelligence (Advisor: Minjoon Seo)
Yonsei UniversityMar. 2018 - Feb. 2023
B.S. in Computer Science
Publications
Preprints
RefineBench: Evaluating Refinement Capability of Language Models via Checklists
Young-Jun Lee, Seungone Kim, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang, Jong Myoung Kim, Graham Neubig, Sean Welleck, Ho-Jin Choi
Preprint Under Review
SPICE: Self-play in corpus environments improves reasoning
Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston
Preprint Under Review
VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding
Abdul Waheed, Zhen Wu, Dareen Alharthi, Seungone Kim, Bhiksha Raj
Preprint Under Review
OptimalThinkingBench: Evaluating over and underthinking in LLMs
Pranjal Aggarwal, Seungone Kim, Jack Lanchantin, Sean Welleck, Jason Weston, Ilia Kulikov, Swarnadeep Saha
Preprint Under Review
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue
Preprint Under Review
Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability
Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, Niklas Muennighoff, Seungone Kim, Hanyang Zhao, Sudipta Kar, Kezia Erina Suryoraharjo, M Farid Adilazuarda, En-Shiun Annie Lee, Ayu Purwarianti, Derry Tanti Wijaya, Monojit Choudhury
Preprint Under Review
The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think
Seongyun Lee*, Seungone Kim*, Minju Seo, Yongrae Jo, Dongyoung Go, Hyeonbin Hwang, Jinho Park, Xiang Yue, Sean Welleck, Graham Neubig, Moontae Lee, Minjoon Seo
Preprint Under Review
Let's Predict Sentence by Sentence
Hyeonbin Hwang, Byeongguk Jeon, Seungone Kim, Jiyeon Kim, Hoyeon Chang, Sohee Yang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo
Preprint Under Review
FREESON: Retriever-Free Retrieval-Augmented Reasoning via Corpus-Traversing MCTS
Chaeeun Kim, Seungone Kim
Preprint Under Review
2025
Reasoning Models Better Express Their Confidence
Dongkeun Yoon, Seungone Kim, Sohee Yang, Sunkyoung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, Minjoon Seo
NeurIPS 2025
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo
NeurIPS 2025
Measuring Sycophancy of Language Models in Multi-turn Dialogues
Jiseung Hong, Grace Byun, Seungone Kim, Kai Shu, Jinho D. Choi
EMNLP 2025
M-Prometheus: A Suite of Open Multilingual LLM Judges
Jose Pombal, Dongkeun Yoon, Patrick Fernandes, Ian Wu, Seungone Kim, Ricardo Rei, Graham Neubig, Andre F.T. Martins
COLM 2025
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
NAACL 2025 (Best Paper Award)
Bridging the Data Provenance Gap Across Text, Speech, and Video
Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi LI, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester James Validad Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma, Xuhui Zhou, Caiming Xiong, Luis Villa, Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara
ICLR 2025
Better Instruction-Following Through Minimum Bayes Risk
Ian Wu, Patrick Fernandes, Amanda Bertsch, Seungone Kim, Sina Pakazad, Graham Neubig
ICLR 2025 (Spotlight)
Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, Jianguo Zhang, Joanna Materzynska, Kun Qian, Kush Tiwary, Lester James Validad Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Vu Minh Chien, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Sandy Pentland
NeurIPS 2024
( * indicates equal contribution )
Vitæ
Full CV in PDF.
-
FAIR @ Meta May. 2026 - Dec. 2026Research Intern (Mentors: Jason Weston)
TBD -
FAIR @ Meta May. 2025 - Dec. 2025Research Intern (Mentors: Ilia Kulikov, Jason Weston)
Worked on developing a synthetic dataset that improves reasoning capabilities of LMs. -
CMU LTI Aug. 2024 - PresentPh.D. in Computer Science (Advisors: Graham Neubig and Sean Welleck)
Working on (V)LM Evaluation and Weak-to-Strong Generalization. -
AML Lab @ LG AI Research Jan. 2024 - Jun. 2024Research Intern (Mentor: Kyungjae Lee)
Worked on building a comprehensive NLG benchmark that could mimic the fineness of human evaluation. -
Language Lab @ Naver AI Lab Mar. 2023 - Dec. 2023Research Intern (Mentor: Jamin Shin)
Worked on building an open-sourced evaluator LM & VLM that could potentially replace GPT-4 and GPT-4V Evaluation. -
KAIST AI Mar. 2023 - Jul. 2024M.S. in Artificial Intelligence (Advisor: Minjoon Seo)
Worked on developing evaluator LM & VLMs and Chain-of-Thought fine-tuning. Early Graduation (3 semesters). -
LK Lab @ KAIST AI Jul. 2022 - Feb. 2023Research Intern (Mentor: Joel Jang)
Worked on developing expert LMs that can generalize to novel tasks. -
Yonsei University Mar. 2018 - Feb. 2023B.S. in Computer Science
Early Graduation (7 semesters).
