| CARVIEW |
RESEARCH
At MIT, I am fortunate to be advised by David Sontag, and I work closely with Yoon Kim. During my PhD, I have interned at Meta FAIR,
Ai2, and
Microsoft Research. Previously, I was a predoctoral researcher at Ai2 and Harvard University, and I obtained my M.S. from Brown University.
My long-term research goal is to build AI that enables long-term collaboration with people to solve challenging, knowledge-intensive problems. To that end, I focus on three directions:
- Understanding human LLM collaboration: How to scalably and quantitatively evaluate Human-LLM collaboration? What are the metrics/objectives to optimize for?
- Improving the underlying LLM: How to train LLMs for effective collaboration? What are the needed algorithmic innovations?
- Deploying collaborative AI systems in practice: What are the desired interactions for effective collaboration? How to efficiently collect user feedback and improve models?
Click on each headline to see featured projects below.
LaText: Interleave Latent and Text Chain-of-Thought
We train an LM to interleave latent and text reasoning and keep critical tokens like math in context. It can achieve close performance to text-only CoT performance with 50% inference compute.
LATEST
Awards at NeurIPS 2025 Workshops
Our recent papers are recognized at several NeurIPS 2025 workshops
Talk at the Scale ML Seminar Series @ MIT
LaText: Interleave Latent and Text Chain-of-Thought for efficient reasoning
Workshop Organizing
Co-organizing the LM4Sci Workshop at COLM 2025
Talk at Stanford HCI Group Lunch Seminar
Rethinking the Design and Evaluation of Human and LLM Collaboration
News
Co-LLM and SymGen are covered by MIT News
Organizing a New Seminar Series at MIT
MIT NLP Meetings Seminar Series
Talk at University of Washington
Co-LLM: Training LLMs to Decode Collaboratively
News
Student Spotlight interview by CSAIL Alliances
Talk at Google Research
Developing User-Friendly Language Language Model Systems
RSAP panel at the American Literature Association conference
LayoutParser and Historical Document Image Processing
Talk at Ranjay Krishna’s Group @ UW
Developing User-Friendly Language Language Model Systems
Talk at MIT Sloan AI/ML Conference
Towards Verifiable Text Generation for Developing Trustworthy LLMs
Discussion on Image Extraction, hosted by Thomas Smits at University of Amsterdam
LayoutParser and Historical Document Image Processing
Instructor for an MIT IAP Class
Visual Design in Scholarly Communication
Blog Post
Introducing Chapyter
Talk at Nigam Shah’s Group Meeting @ Stanford
Redesigning Clinical Documentation
Talk at Natural Legal Language Processing workshop @ EMNLP 2022
Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities
Guest Lecture in CSE 599D @ UW, hosted by Prof. Jeff Heer
Visual Content Extraction for Scientific Documents
Our recent papers are recognized at NeurIPS 2025:
- The Collaborative Effort Scaling framework is recognized as the best paper at the NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models.
- The Hybrid CoT (LaText) paper is recognized as a spotlight paper at the NeurIPS 2025 Workshop on Efficient Reasoning.
I gave a talk on our recent work on LaText, a novel approach to interleave latent and text chain-of-thought for efficient reasoning.
I’m co-organizing the Workshop on Large Language Modeling for Scientific Discovery (LM4Sci) at COLM 2025 in Montreal.
I shared an initial version of our collaborative effort scaling paper, and discussed the HCI aspects of our previous work on Symbolic Generation.
Check the MIT News articles covering our recent projects:
Pratyusha Sharma and I started to organize a new NLP seminar series at MIT. It features NLP researchers working on a diverse set of topics ranging from LLMs, interpretability, Human AI Collaboration, and more.
This talk is hosted by Luke Zettlemoyer’s group. We go through the details of our ACL paper Co-LLM. You can find the slides here.
In a recent interview by CSAIL Alliances, I shared our recent work on Co-LLM and SymGen and described my vision for building better language model or AIs with a human-centered perspective.
This talk is hosted by Chiyuan Zhang and Yangsibo Hunag. We focused on the Co-LLM project and had a deep dive in the methodology and experiments. Slides available upon request.
We reviewed the LayoutParser design and functionality, as well as approaches to tackle historical image processing and extraction in 2024. Slides available upon request.
We start with the analogy between web interface development and llm development: LLM can produces raw text (as if htmls for the web pages) – what is the CSS and javascript in the context of LLMs? We then talk about two recent projects, Co-LLM and SymGen, drawing connections between our methods and web technologies like CSS, API calls, etc. Slides available upon request.
In this short talk, we cover our latest research on SymGen, a novel approach to generating verifiable text for developing trustworthy LLMs. Slides available upon request.
We reviewed the LayoutParser design and functionality, as well as approaches to tackle historical image processing and extraction in 2024. Slides available upon request.
A series of lectures over the MIT IAP period, co-taught with Lucas Torroba Hennigen, focused on visual design in scholarly communication. Visual design is a crucial element in various forms of scientific communication, ranging from papers, slides, to even videos. While there is an increasing need for researchers to produce high-quality visuals, it remains to be a time-consuming and sometimes very challenging task. Despite the significant role they play, there is a noticeable lack of formal education dedicated to this aspect. This subject aims to cover several key topics about visual designs in scholarly communication.
Chapyter is a JupyterLab extension that seamlessly connects GPT-4 to your coding environment. It features a code interpreter that can translate your natural language description into Python code and automatically execute it.
We took the inspiration from our position paper on AI supported expository writing and discuss how to apply such ideas in clinical documentation. This is a joint presentation with Monica Agrawal and Hunter Lang.
A presentation of our work on the Multi-LexSum dataset, containing real-world summaries of civil rights lawsuits at multiple granularities.
We reviewed the general problem of visual content extraction in scientific documents, as well as the current state-of-the-art methods and challenges. Slides available upon request.
PUBLICATIONS
2025
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research Featured
Rulin Shao†, Akari Asai†, Shannon Zejiang Shen†, Hamish Ivison†, Varsha Kishore, Jingming Zhuo†, Xinran Zhao, Molly Park, Samuel Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh
LaText: Interleave Latent and Text Chain-of-Thought for efficient reasoning Featured
Shannon Zejiang Shen, Rulin Shao, Chenyu Wang, Songlin Yang, Vincent-Pierre Berges, Gargi Ghosh, Pang Wei Koh, Luke Zettlemoyer, Yoon Kim, Jason E Weston, David Sontag, Wen-tau Yih
Completion ≠ Collaboration: Scaling Collaborative Effort with Agents Featured
Shannon Zejiang Shen†, Valerie Chen†, Ken Gu, Alexis Ross, Zixian Ma, Alex Gu, Chenglei Si, Jillian Ross, Jocelyn J Shen, Wayne Chi, Andi Peng, Ameet Talwalkar, Tongshuang Wu†, David Sontag†
2024
Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E Peters, Abhilasha Ravichander, Kyle Richardson, Shannon Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo
2023
PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents
Kyle Lo, Shannon Zejiang Shen, Benjamin Newman, Joseph Chang, Russell Authur, Erin Bransom, Stefan Candra, Yoganand Chandrasekhar, Regan Huff, Bailey Kuehl, Amanpreet Singh, Chris Wilhelm, Angele Zamarron, Marti A. Hearst, Daniel Weld, Doug Downey, and Luca Soldaini
2022
2021
VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups
Shannon Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Daniel S. Weld, and Doug Downey
LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis
Shannon Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li
2025[8]
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research Featured
Rulin Shao†, Akari Asai†, Shannon Zejiang Shen†, Hamish Ivison†, Varsha Kishore, Jingming Zhuo†, Xinran Zhao, Molly Park, Samuel Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh
LaText: Interleave Latent and Text Chain-of-Thought for efficient reasoning Featured
Shannon Zejiang Shen, Rulin Shao, Chenyu Wang, Songlin Yang, Vincent-Pierre Berges, Gargi Ghosh, Pang Wei Koh, Luke Zettlemoyer, Yoon Kim, Jason E Weston, David Sontag, Wen-tau Yih
Completion ≠ Collaboration: Scaling Collaborative Effort with Agents Featured
Shannon Zejiang Shen†, Valerie Chen†, Ken Gu, Alexis Ross, Zixian Ma, Alex Gu, Chenglei Si, Jillian Ross, Jocelyn J Shen, Wayne Chi, Andi Peng, Ameet Talwalkar, Tongshuang Wu†, David Sontag†
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
Chenyu Wang†, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu†
OLMo 3
With the Ai2 OLMo Team
Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shane Arora, Shashank Gupta, Taira Anderson, Teng Xiao, Tyler Murray, Tyler Romero, Victoria Graf, Akari Asai, Akshita Bhagia, Alex Wettig, Alisa Liu, Aman Rangapur, Chloe Anastasiades, Costa Huang, Dustin Schwenk, Harsh Trivedi, Ian Magnusson, Jaron Lochner, Jiacheng Liu, Lj Miranda, Maarten Sap, Malia Morgan, Michael Schmitz, Michal Guerquin, Michael Wilson, Regan Huff, Ronan Le Bras, Rui Xin, Rulin Shao, Sam Skjonsberg, Shannon Zejiang Shen, Shuyue Stella Li, Tucker Wilde, Valentina Pyatkin, Will Merrill, Yapei Chang, Yuling Gu, Zhiyuan Zeng, Ashish Sabharwal, Luke Zettlemoyer, Pang Wei Koh, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi
SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
Retrieval-augmented systems can be dangerous medical communicators
Lionel Wong, Ayman Ali, Raymond Xiong, Shannon Zejiang Shen, Yoon Kim, Monica Agrawal
When One LLM Drools, Multi-LLM Collaboration Rules
Shangbin Feng, Wenxuan Ding, Alisa Liu, Zifeng Wang, Weijia Shi, Yike Wang, Shannon Zejiang Shen, Xiaochuang Han, Hunter Lang, Chen-Yu Lee, Tomas Pfister, Yejin Choi, Yulia Tsvetkov
2024[7]
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature
David Wadden, Kejian Shi, Jacob Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishirzi, and Arman Cohan
EMNLP 2025
Machine learning to predict notes for chart review in the oncology setting: a proof of concept strategy for improving clinician note-writing
Sharon Jiang, Barbara Lam, Monica Agrawal, Shannon Zejiang Shen, Nicholas Kurtzman, Steven Horng, David Karger, and David Sontag
A Design Space for Intelligent and Interactive Writing Assistants
Mina Lee, Katy Ilonka Gero, John Joon Young Chung, Simon Buckingham Shum, Vipul Raheja, Hua Shen, Subhashini Venugopalan, Thiemo Wambsganss, David Zhou, Emad A. Alghamdi, Tal August, Avinash Bhat, Madiha Zahrah Choksi, Senjuti Dutta, Jin L.C. Guo, Md Naimul Hoque, Yewon Kim, Simon Knight, Seyed Parsa Neshaei, Antonette Shibani, Disha Shrivastava, Lila Shroff, Agnia Sergeyuk, Jessi Stark, Sarah Sterman, Sitong Wang, Antoine Bosselut, Daniel Buschek, Joseph Chee Chang, Sherol Chen, Max Kreminski, Joonsuk Park, Roy Pea, Eugenia Ha Rim Rho, Shannon Zejiang Shen, and Pao Siangliulue
A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models
Stefan Hegselmann, Shannon Zejiang Shen, Florian Gierse, Monica Agrawal, David Sontag, and Xiaoyi Jiang
Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E Peters, Abhilasha Ravichander, Kyle Richardson, Shannon Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo
2023[7]
PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents
Kyle Lo, Shannon Zejiang Shen, Benjamin Newman, Joseph Chang, Russell Authur, Erin Bransom, Stefan Candra, Yoganand Chandrasekhar, Regan Huff, Bailey Kuehl, Amanpreet Singh, Chris Wilhelm, Angele Zamarron, Marti A. Hearst, Daniel Weld, Doug Downey, and Luca Soldaini
American Stories: A Large-Scale Structured Text Dataset of Historical US Newspapers
Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Shannon Zejiang Shen, Luca D’Amico-Wong, Quan Le, Pablo Querubin, and Leander Heldring
Conceptualizing Machine Learning for Dynamic Information Retrieval of Electronic Health Record Notes
Sharon Jiang, Shannon Zejiang Shen, Monica Agrawal, Barbara Lam, Nicholas Kurtzman, Steven Horng, David Karger, and David Sontag
Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents
Catherine Chen, Shannon Zejiang Shen, Dan Klein, Gabriel Stanovsky, Doug Downey, and Kyle Lo
Beyond Summarization: Designing AI Support for Real-World Expository Writing Tasks
Shannon Zejiang Shen, Tal August, Pao Siangliulue, Kyle Lo, Jonathan Bragg, Jeff Hammerbacher, Doug Downey, Joseph Chee Chang, and David Sontag
The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces
With the Semantic Scholar Team
Kyle Lo, Joseph Chee Chang, Andrew Head, Jonathan Bragg, Amy X Zhang, Cassidy Trier, Chloe Anastasiades, Tal August, Russell Authur, Danielle Bragg, Erin Bransom, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Yen-Sung Chen, Evie Yu-Yen Cheng, Yvonne Chou, Doug Downey, Rob Evans, Raymond Fok, Fangzhou Hu, Regan Huff, Dongyeop Kang, Tae Soo Kim, Rodney Kinney, Aniket Kittur, Hyeonsu Kang, Egor Klevak, Bailey Kuehl, Michael Langan, Matt Latzke, Jaron Lochner, Kelsey MacMillan, Eric Marsh, Tyler Murray, Aakanksha Naik, Ngoc-Uyen Nguyen, Srishti Palani, Soya Park, Caroline Paulic, Napol Rachatasumrit, Smita Rao, Paul Sayre, Shannon Zejiang Shen, Pao Siangliulue, Luca Soldaini, Huy Tran, Madeleine van Zuylen, Lucy Lu Wang, Christopher Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Marti A Hearst, and Daniel S Weld
Communications of the ACM
The semantic scholar open data platform
With the Semantic Scholar Team
Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David Graham, Fangzhou Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Bailey Kuehl, Michael Langan, Daniel Lin, Haokun Liu, Kyle Lo, Jaron Lochner, Kelsey MacMillan, Tyler Murray, Chris Newell, Smita Rao, Shaurya Rohatgi, Paul Sayre, Shannon Zejiang Shen, Amanpreet Singh, Luca Soldaini, Shivashankar Subramanian, Amber Tanaka, Alex D Wade, Linda Wagner, Lucy Lu Wang, Chris Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Madeleine Van Zuylen, and Daniel S Weld
2022[3]
Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search
Daniel King†, Shannon Zejiang Shen†, Nishant Subramani, Daniel S. Weld, Iz Beltagy, and Doug Downey
OLALA: Object-Level Active Learning for Efficient Document Layout Annotation
Shannon Zejiang Shen, Jian Zhao, Melissa Dell, Yaoliang Yu, and Weining Li
2021[3]
VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups
Shannon Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Daniel S. Weld, and Doug Downey
LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis
Shannon Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li
2020[2]
A Large Dataset of Historical Japanese Documents with Complex Layouts
Shannon Zejiang Shen, Kaixuan Zhang, and Melissa Dell
Generating Object Stamps
Youssef Alami Mejjati, Shannon Zejiang Shen, Michael Snower,
Aaron Gokaslan, Oliver Wang, James Tompkin, and Kwang In Kim
2019[2]
Information Extraction from Text Regions with Complex Tabular Structure
Kaixuan Zhang, Shannon Zejiang Shen, Jie Zhou, and Melissa Dell
Deep Learning based Framework for Automatic Damage Detection in Aircraft Engine Borescope Inspection
Shannon Zejiang Shen, Xili Wan, Feng Ye, Xinjie Guan, and Shuwen Liu
2019 International Conference on Computing, Networking and Communications (ICNC)
2025
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research Featured
Rulin Shao†, Akari Asai†, Shannon Zejiang Shen†, Hamish Ivison†, Varsha Kishore, Jingming Zhuo†, Xinran Zhao, Molly Park, Samuel Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh
LaText: Interleave Latent and Text Chain-of-Thought for efficient reasoning Featured
Shannon Zejiang Shen, Rulin Shao, Chenyu Wang, Songlin Yang, Vincent-Pierre Berges, Gargi Ghosh, Pang Wei Koh, Luke Zettlemoyer, Yoon Kim, Jason E Weston, David Sontag, Wen-tau Yih
Completion ≠ Collaboration: Scaling Collaborative Effort with Agents Featured
Shannon Zejiang Shen†, Valerie Chen†, Ken Gu, Alexis Ross, Zixian Ma, Alex Gu, Chenglei Si, Jillian Ross, Jocelyn J Shen, Wayne Chi, Andi Peng, Ameet Talwalkar, Tongshuang Wu†, David Sontag†
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
Chenyu Wang†, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu†
OLMo 3
With the Ai2 OLMo Team
Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shane Arora, Shashank Gupta, Taira Anderson, Teng Xiao, Tyler Murray, Tyler Romero, Victoria Graf, Akari Asai, Akshita Bhagia, Alex Wettig, Alisa Liu, Aman Rangapur, Chloe Anastasiades, Costa Huang, Dustin Schwenk, Harsh Trivedi, Ian Magnusson, Jaron Lochner, Jiacheng Liu, Lj Miranda, Maarten Sap, Malia Morgan, Michael Schmitz, Michal Guerquin, Michael Wilson, Regan Huff, Ronan Le Bras, Rui Xin, Rulin Shao, Sam Skjonsberg, Shannon Zejiang Shen, Shuyue Stella Li, Tucker Wilde, Valentina Pyatkin, Will Merrill, Yapei Chang, Yuling Gu, Zhiyuan Zeng, Ashish Sabharwal, Luke Zettlemoyer, Pang Wei Koh, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi
SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
Retrieval-augmented systems can be dangerous medical communicators
Lionel Wong, Ayman Ali, Raymond Xiong, Shannon Zejiang Shen, Yoon Kim, Monica Agrawal
When One LLM Drools, Multi-LLM Collaboration Rules
Shangbin Feng, Wenxuan Ding, Alisa Liu, Zifeng Wang, Weijia Shi, Yike Wang, Shannon Zejiang Shen, Xiaochuang Han, Hunter Lang, Chen-Yu Lee, Tomas Pfister, Yejin Choi, Yulia Tsvetkov
2024
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature
David Wadden, Kejian Shi, Jacob Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishirzi, and Arman Cohan
EMNLP 2025
Machine learning to predict notes for chart review in the oncology setting: a proof of concept strategy for improving clinician note-writing
Sharon Jiang, Barbara Lam, Monica Agrawal, Shannon Zejiang Shen, Nicholas Kurtzman, Steven Horng, David Karger, and David Sontag
A Design Space for Intelligent and Interactive Writing Assistants
Mina Lee, Katy Ilonka Gero, John Joon Young Chung, Simon Buckingham Shum, Vipul Raheja, Hua Shen, Subhashini Venugopalan, Thiemo Wambsganss, David Zhou, Emad A. Alghamdi, Tal August, Avinash Bhat, Madiha Zahrah Choksi, Senjuti Dutta, Jin L.C. Guo, Md Naimul Hoque, Yewon Kim, Simon Knight, Seyed Parsa Neshaei, Antonette Shibani, Disha Shrivastava, Lila Shroff, Agnia Sergeyuk, Jessi Stark, Sarah Sterman, Sitong Wang, Antoine Bosselut, Daniel Buschek, Joseph Chee Chang, Sherol Chen, Max Kreminski, Joonsuk Park, Roy Pea, Eugenia Ha Rim Rho, Shannon Zejiang Shen, and Pao Siangliulue
A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models
Stefan Hegselmann, Shannon Zejiang Shen, Florian Gierse, Monica Agrawal, David Sontag, and Xiaoyi Jiang
Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E Peters, Abhilasha Ravichander, Kyle Richardson, Shannon Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo
2023
PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents
Kyle Lo, Shannon Zejiang Shen, Benjamin Newman, Joseph Chang, Russell Authur, Erin Bransom, Stefan Candra, Yoganand Chandrasekhar, Regan Huff, Bailey Kuehl, Amanpreet Singh, Chris Wilhelm, Angele Zamarron, Marti A. Hearst, Daniel Weld, Doug Downey, and Luca Soldaini
American Stories: A Large-Scale Structured Text Dataset of Historical US Newspapers
Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Shannon Zejiang Shen, Luca D’Amico-Wong, Quan Le, Pablo Querubin, and Leander Heldring
Conceptualizing Machine Learning for Dynamic Information Retrieval of Electronic Health Record Notes
Sharon Jiang, Shannon Zejiang Shen, Monica Agrawal, Barbara Lam, Nicholas Kurtzman, Steven Horng, David Karger, and David Sontag
Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents
Catherine Chen, Shannon Zejiang Shen, Dan Klein, Gabriel Stanovsky, Doug Downey, and Kyle Lo
Beyond Summarization: Designing AI Support for Real-World Expository Writing Tasks
Shannon Zejiang Shen, Tal August, Pao Siangliulue, Kyle Lo, Jonathan Bragg, Jeff Hammerbacher, Doug Downey, Joseph Chee Chang, and David Sontag
The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces
With the Semantic Scholar Team
Kyle Lo, Joseph Chee Chang, Andrew Head, Jonathan Bragg, Amy X Zhang, Cassidy Trier, Chloe Anastasiades, Tal August, Russell Authur, Danielle Bragg, Erin Bransom, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Yen-Sung Chen, Evie Yu-Yen Cheng, Yvonne Chou, Doug Downey, Rob Evans, Raymond Fok, Fangzhou Hu, Regan Huff, Dongyeop Kang, Tae Soo Kim, Rodney Kinney, Aniket Kittur, Hyeonsu Kang, Egor Klevak, Bailey Kuehl, Michael Langan, Matt Latzke, Jaron Lochner, Kelsey MacMillan, Eric Marsh, Tyler Murray, Aakanksha Naik, Ngoc-Uyen Nguyen, Srishti Palani, Soya Park, Caroline Paulic, Napol Rachatasumrit, Smita Rao, Paul Sayre, Shannon Zejiang Shen, Pao Siangliulue, Luca Soldaini, Huy Tran, Madeleine van Zuylen, Lucy Lu Wang, Christopher Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Marti A Hearst, and Daniel S Weld
Communications of the ACM
The semantic scholar open data platform
With the Semantic Scholar Team
Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David Graham, Fangzhou Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Bailey Kuehl, Michael Langan, Daniel Lin, Haokun Liu, Kyle Lo, Jaron Lochner, Kelsey MacMillan, Tyler Murray, Chris Newell, Smita Rao, Shaurya Rohatgi, Paul Sayre, Shannon Zejiang Shen, Amanpreet Singh, Luca Soldaini, Shivashankar Subramanian, Amber Tanaka, Alex D Wade, Linda Wagner, Lucy Lu Wang, Chris Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Madeleine Van Zuylen, and Daniel S Weld
2022
Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search
Daniel King†, Shannon Zejiang Shen†, Nishant Subramani, Daniel S. Weld, Iz Beltagy, and Doug Downey
OLALA: Object-Level Active Learning for Efficient Document Layout Annotation
Shannon Zejiang Shen, Jian Zhao, Melissa Dell, Yaoliang Yu, and Weining Li
2021
VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups
Shannon Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Daniel S. Weld, and Doug Downey
LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis
Shannon Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li
2020
A Large Dataset of Historical Japanese Documents with Complex Layouts
Shannon Zejiang Shen, Kaixuan Zhang, and Melissa Dell
Generating Object Stamps
Youssef Alami Mejjati, Shannon Zejiang Shen, Michael Snower,
Aaron Gokaslan, Oliver Wang, James Tompkin, and Kwang In Kim
2019
Information Extraction from Text Regions with Complex Tabular Structure
Kaixuan Zhang, Shannon Zejiang Shen, Jie Zhou, and Melissa Dell
Deep Learning based Framework for Automatic Damage Detection in Aircraft Engine Borescope Inspection
Shannon Zejiang Shen, Xili Wan, Feng Ye, Xinjie Guan, and Shuwen Liu
2019 International Conference on Computing, Networking and Communications (ICNC)
PROJECTS
Besides research, I've worked on various open source projects and here are a few of them:
Productivity & Utils
A JupyterLab extension that seamlessly connects GPT-4 to your coding environment. It features a code interpreter that can translate your natural language description into Python code and automatically execute it.
A Python package that seamlessly connects notion databases and pandas dataframe. It allows for easy uploading/downloading Notion databases to/from pandas dataframe.
An Obsidian plugin that streamlines bibliography management.
Websites & Design
A platform for current and past grad students to share their statement of purposes during application to help future applicants. It is a full-fledged website based on notion, and we develop an automated submission system that connects the notion database with a google form (code available here).
The layout-parser project website is built based on jekyll and bulma. Most interestingly, the layout-parser platform subpage is rendered by live fetching the model metadata stored in Github issues.
Avalanche: a personal website theme for academics
Also based on jekyll and bulma, the Avalanche theme can be used out-of-the box for creating an academic site beautifully displaying personal research description, publications, as well as recent news.