| CARVIEW |
Grounding has been a long-standing concept in natural language processing (NLP) and computational linguistics (CL). This tutorial provides a historical overview and introduces recent advances in learning language through grounding, with a particular emphasis on the latter. We will begin by tracing the history of grounding and presenting a unified perspective on the term. In Parts II to IV, we will delve into recent progress in learning lexical semantics, syntax, and complex meanings through various forms of grounding. We will conclude by discussing future directions and open challenges, particularly those related to the growing trend of large language models and scaling.
Materials (180 minutes)
Part I (15 minutes): Introduction to grounding. [Slides] [Video]
Presenter: Freda Shi
We will review the history of grounding, and introduce the unified definition of grounding. In particular, grounding, in this tutorial, refers to processing the primary data with supervision from another source, where the two sources of data have positive mutual information. We will exemplify the definition through connection to existing work such as visual grounding, acoustic grounding, factual grounding, and cross-lingual grounding.
We refer to NAACL 2024 Tutorial 6 on spatial and temporal grounding, ACL 2020 Tutorial 5 on building common ground through communication, and AAAI 2013 Keynote for early work on grounded language learning.
Part II (25 minutes): Learning lexicons through grounding. [Slides] [Video]
Presenter: Martin Ziqiao Ma
Word acquisition is a core challenge in both cognitive science and robotics. Recent advances in neural networks and multimodal machine learning have enabled efforts to ground the meanings of written and spoken words in visual signals. In this talk, we will explore research on grounding noun and verb meanings through changes in the physical world. We will also briefly discuss extensions of lexicon grounding beyond the visual modality, as well as approaches to bootstrapping grounded word acquisition through meta-learning.
In the first 10 minutes, we will introduce the background and focus on recent advances in the remaining time. Work on vision-language models, learning lexical semantics through interaction or learning lexicon to compose sentence-level meanings will be deferred to Part IV.
Part III (25 minutes): Learning syntax through grounding. [Slides] [Video]
Presenter: Freda Shi
Constituency parses of sentences can be learned by grounding to visual signals. Follow-up work has demonstrated the effectiveness of such visually grounded systems on learning variants of constituency and dependency grammars. On another line, word alignment, based cross-lingual transfer can also be considered as an instantiation of learning syntax through cross-lingual grounding, where the text in the target language(s) is grounded to existing knowledge in the source language(s).
A brief introduction of related syntactic knowledge, such as constituency, dependency, and combinatory categorial grammars, will be presented in the first 10 minutes of this part to help the audience better understand the content. We will focus on recent approaches to learning syntax through visual grounding and cross-lingual grounding in the rest of the time. Efforts on joint learning of syntax and semantics will be delivered in Part IV.
Part IV (100 minutes): Learning complex meanings (semantics and pragmatics) through grounding.
Part IV-1 (25 minutes): Learning concepts through grounding. [Slides] [Video]
Presenter: Jiayuan Mao
Grounded lexicon learning and grounded syntax learning come together to enable the formation of complex, compositional grounded concepts. Lexicon learning maps individual words to grounded perceptual or executable representations, while syntax learning governs how these word-level representations are composed into structured meanings. By integrating both, models can not only learn visual or perceptual concepts from language but also generalize to novel compositions, facilitating systematic and interpretable understanding of grounded semantics across diverse domains.
Part IV-2 (25 minutes): Grounding language to world representations: The case of space. [Slides] [Video]
Presenter: Parisa Kordjamshidi
We cover how spatial semantics are represented, the available datasets and annotations, and the connection between information extraction models, qualitative spatial reasoning, and end-to-end deep learning approaches. We review recent large language models for spatial language comprehension, their evaluation, and the key limitations and challenges in this area. We clarify the role of spatial language in downstream applications, highlighting tasks such as grounding language in the visual world for navigation, wayfinding agents, human-machine interaction, and situated dialogue systems.
Part IV-3 (25 minutes): Scaling vision-language models with grounding. [Slides] [Video]
Presenter: Martin Ziqiao Ma
While modern vision-language models (VLMs) have made remarkable progress, achieving fine-grained grounding of linguistic units to perceptual referents remains an open challenge. We will review recent advances in mechanistically grounded VLMs, spanning both encoder-based and generative ones. We highlight how these models offer more detailed perceptual understanding and greater interpretability, providing new insights into the mechanisms underlying grounded language acquisition.
Part IV-4 (25 minutes): Learning pragmatics through grounding. [Slides] [Video]
Presenter: Joyce Chai
Grounded interaction provides a powerful source of supervision for language learning, connecting linguistic expressions directly to perception and action. Beyond mapping words to perceptual referents, successful communication requires models to interpret language in context — leveraging shared goals, conventions, and the visual and embodied environment. We discuss research on grounded settings and pragmatic modeling, analyzing how grounding in physical and social contexts shapes linguistic meaning, and how task goals, environmental structure, and communicative affordances enrich the process of language grounding.
Part V (15 minutes): Future directions and open problems. [Slides] [Video]
Presenter: Freda Shi
A key discussion for future directions centers around whether grounding should emerge naturally from scaling models or whether we should enforce grounded supervision to achieve more efficient learning. Additionally, the scope of grounding can be broadened beyond traditional modalities, incorporating touch, olfaction, non-human sensors, video and temporal data, 3D environments, proprioception, episodic experiences, and even other forms of meta-cognition.
References show selected / show all by topic
Overview
Lexicon
Learning /
Syntax
Learning /
Semantics Learning /
Pragmatics Learning
Crossmodal Grounding /
Crosslingual Grounding /
Epistemic
Grounding /
Interactive Grounding
Learning Language Structures through Grounding
Haoyue Freda Shi
PhD Thesis, Toyota Technological Institute at Chicago 2024
Paper
Thesis of Distinction
The Vector Grounding Problem
Dimitri Coelho Mollo, Raphaël Millière
arXiv preprint arXiv:2304.01481 2023
Paper
Pragmatics in Language Grounding: Phenomena, Tasks, and Modeling Approaches
Daniel Fried, Nicholas Tomlin, Jennifer Hu, Roma Patel, Aida Nematzadeh
Findings of EMNLP 2023
Paper
Grounding 'Grounding' in NLP
Khyathi Raghavi Chandu, Yonatan Bisk, Alan W. Black
Findings of ACL 2021
Paper
Experience Grounds Language
Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, Joseph Turian
EMNLP 2020
Paper
Language to Action: Towards Interactive Task Learning with Physical Agents
Joyce Y. Chai, Qiaozi Gao, Lanbo She, Shaohua Yang, Sari Saba-Sadiya, Guangyue Xu
IJCAI 2018
Paper
Invited Paper
Grounding in Communication
Herbert H. Clark, Susan E. Brennan
Perspectives on socially shared cognition 1991
Paper
World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models
Ziqiao Ma, Jiayi Pan, Joyce Chai
ACL 2023
Paper
Outstanding Paper Award
Word Discovery in Visually Grounded, Self-Supervised Speech Models
Puyuan Peng, David Harwath
Interspeech 2022
Paper
Oral Presentation
Cross-lingual Entity Alignment with Incidental Supervision
Muhao Chen, Weijia Shi, Ben Zhou, Dan Roth
EACL 2021
Paper
Bilingual Lexicon Induction via Unsupervised Bitext Construction and Word Alignment
Haoyue Shi, Luke Zettlemoyer, Sida I. Wang
ACL 2021
Paper
Learning Morphosyntactic Analyzers from the Bible via Iterative Annotation Projection across 26 Languages
Garrett Nicolai, David Yarowsky
ACL 2019
Paper
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, Jiajun Wu
ICLR 2019
Paper
Oral Presentation
Bilingual Lexicon Induction through Unsupervised Machine Translation
Mikel Artetxe, Gorka Labaka, Eneko Agirre
ACL 2019
Paper
Acoustically Grounded Word Embeddings for Improved Acoustics-to-Word Speech Recognition
Shane Settle, Kartik Audhkhasi, Karen Livescu, Michael Picheny
ICASSP 2019
Paper
Verb Physics: Relative Physical Knowledge of Actions and Objects
Maxwell Forbes, Yejin Choi
ACL 2017
Paper
Interactive Learning of Grounded Verb Semantics Towards Human-Robot Communication
Lanbo She, Joyce Chai
ACL 2017
Paper
Physical Causality of Action Verbs in Grounded Language Understanding
Qiaozi Gao, Malcolm Doering, Shaohua Yang, Joyce Chai
ACL 2016
Paper
Incremental Acquisition of Verb Hypothesis Space Towards Physical World Interaction
Lanbo She, Joyce Chai
ACL 2016
Paper
Reframing Linguistic Bootstrapping as Joint Inference Using Visually-Grounded Grammar Induction Models
Eva Portelance, Siva Reddy, Timothy J O'Donnell
arXiv:2406.11977 2024
Paper
Audio-Visual Neural Syntax Acquisition
Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass
ASRU 2023
Paper
Substructure Distribution Projection for Zero-Shot Cross-Lingual Dependency Parsing
Freda Shi, Kevin Gimpel, Karen Livescu
ACL 2022
Paper
PPT: Parsimonious Parser Transfer for Unsupervised Cross-Lingual Adaptation
Kemal Kurniawan, Lea Frermann, Philip Schulz, Trevor Cohn
EACL 2021
Paper
"Wikily" Supervised Neural Translation Tailored to Cross-Lingual Tasks
Mohammad Sadegh Rasooli, Chris Callison-Burch, Derry Tanti Wijaya
EMNLP 2021
Paper
Dependency Induction Through the Lens of Visual Perception
Ruisi Su, Shruti Rijhwani, Hao Zhu, Junxian He, Xinyu Wang, Yonatan Bisk, Graham Neubig
CoNLL 2021
Paper
Video-Aided Unsupervised Grammar Induction
Songyang Zhang, Linfeng Song, Lifeng Jin, Kun Xu, Dong Yu, Jiebo Luo
NAACL-HLT 2021
Paper
Best Paper Award
What is Learned in Visually Grounded Neural Syntax Acquisition
Noriyuki Kojima, Hadar Averbuch-Elor, Alexander Rush, Yoav Artzi
ACL 2020
Paper
Visually Grounded Neural Syntax Acquisition
Haoyue Shi, Jiayuan Mao, Kevin Gimpel, Karen Livescu
ACL 2019
Paper
Best Paper Nominee
Unsupervised Dependency Parsing with Transferring Distribution via Parallel Guidance and Entropy Regularization
Xuezhe Ma, Fei Xia
ACL 2014
Paper
Structured Machine Learning for Mapping Natural Language to Spatial Ontologies
Parisa Kordjamshidi
PhD Thesis, KU Leuven 2013
Paper
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference under Ambiguities
Zheyuan Zhang, Fengyuan Hu, Jayjun Lee, Freda Shi, Parisa Kordjamshidi, Joyce Chai, Ziqiao Ma
ICLR 2025
Paper
Oral Presentation
Grammar-Based Grounded Lexicon Learning
Jiayuan Mao, Freda Shi, Jiajun Wu, Roger P. Levy, Joshua B. Tenenbaum
NeurIPS 2021
Paper
SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning
Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, Parisa Kordjamshidi
NAACL-HLT 2021
Paper
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision
Hao Tan, Mohit Bansal
EMNLP 2020
Paper
Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks
Brenden Lake, Marco Baroni
ICML 2018
Paper
Spatial Role Labeling Annotation Scheme
Parisa Kordjamshidi, Martijn van Otterlo, Marie-Francine Moens
Handbook of linguistic annotation 2017
Paper
A Corpus of Natural Language for Visual Reasoning
Alane Suhr, Mike Lewis, James Yeh, Yoav Artzi
ACL 2017
Paper
Best Resource Paper Award
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel
arXiv preprint arXiv:1411.2539 2014
Paper
Pragmatic Inference with a CLIP Listener for Contrastive Captioning
Jiefu Ou, Benno Krojer, Daniel Fried
Findings of ACL 2023
Paper
Computational Language Acquisition with Theory of Mind
Andy Liu, Hao Zhu, Emmy Liu, Yonatan Bisk, Graham Neubig
ICLR 2023
Paper
Language Learning from Communicative Goals and Linguistic Input
Hao Zhu, Yonatan Bisk, Graham Neubig
CogSci 2022
Paper
Interactive Classification by Asking Informative Questions
Lili Yu, Howard Chen, Sida I. Wang, Tao Lei, Yoav Artzi
ACL 2020
Paper
A Knowledge-Grounded Neural Conversation Model
Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, Michel Galley
AAAI 2018
Paper
Learning Language Games through Interaction
Sida I. Wang, Percy Liang, Christopher D. Manning
ACL 2016
Paper
BibTeX
@proceedings{naacl2025grounding,
author = {Shi, Freda and Ma, Ziqiao and Mao, Jiayuan and Kordjamshidi, Parisa and Chai, Joyce},
title = {Learning Language through Grounding},
booktitle = {Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 5: Tutorial Abstracts)},
year = {2025},
}