| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Fri, 12 Dec 2025 20:17:43 GMT
access-control-allow-origin: *
etag: W/"693c7867-449a1"
expires: Mon, 29 Dec 2025 22:18:11 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 471E:444BC:965A78:A8A12C:6952FBC9
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 22:08:11 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210078-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767046092.764856,VS0,VE212
vary: Accept-Encoding
x-fastly-request-id: cdd7c776433aacb4864007a1c7075bbdfddbfff1
content-length: 56332
Daniel Hershcovich - Home Page
Use this template
Tenure-Track Assistant Professor at CoAStaL, Natural Language Processing section, Department of Computer Science, University of Copenhagen, Denmark.
Research interests:
- Adapting and generalizing language models and data across cultures and languages.
- Combining explicit representation of human values and knowledge into language models and their analysis.
- Developing and evaluating language models with real world impact in multicultural domains, such as literature, law and food.
- I was awarded a Villum Experiment grant for my project Aligning Multi-Agent Interactions for Sustainable Food Behaviour (AMAI).
- My project Cultural Reasoning for Responsible Language Model Development (CuRe) with Jens Bjerring-Hansen received funding from Independent Research Fund Denmark Thematic Research on Artificial Intelligence! I will hire a PhD student for it in early 2027.
- Detecting Noise in Legal RAGs (DETECT). Independent Research Fund Denmark. Co-PI with Henrik Palmer Olsen (2026-2028).
- Cultural Reasoning for Responsible Language Model Development (CuRe). Independent Research Fund Denmark Thematic Research on Artificial Intelligence. PI with Jens Bjerring-Hansen (2026-2030).
- Aligning Multi-Agent Interactions for Sustainable Food Behaviour (AMAI). Villum Experiment. PI (2026-2028).
- Automated Legal Information and Knowledge Extraction (ALIKE). Independent Research Fund Denmark. Co-PI with Henrik Palmer Olsen (2025-2027).
- Explainable Hybrid-AI for Computational Law and Accurate Legal Chatbots (XHAILe). Innovation Fund Denmark Grand Solutions. Co-PI with Thomas Hildebrandt (2025-2028).
(Google Scholar, Semantic Scholar)
-
Beyond Demographics: Enhancing Cultural Value Survey Simulation with Multi-Stage Personality-Driven Cognitive Reasoning.
Haijiang Liu, Qiyuan Li, Chao Gao, Yong Cao, Xiangyu Xu, Xun Wu, Daniel Hershcovich and Jinguang Gu. EMNLP 2025.Introducing MARK, the Multi-stAge Reasoning frameworK for cultural value survey response simulation, designed to enhance the accuracy, steerability, and interpretability of large language models in this task. The system is inspired by the type dynamics theory in the MBTI psychological framework for personality research. It effectively predicts and utilizes human demographicnformation for simulation: life-situational stress analysis, group-level personality prediction, and self-weighted cognitive imitation. Experiments on the World Values Survey show that MARK outperforms existing baselines by 10% accuracy and reduces the divergence between model predictions and human preferences. This highlights the potential of our framework to improve zero-shot personalization and help social scientists interpret model predictions. -
HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals.
Guimin Hu, Daniel Hershcovich and Hasti Seifi. Findings of EMNLP 2025. -
Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptions of Wine Reviews.
Chenye Zou, Xingyue Wen, Tianyi Hu, Qian Janice Wang and Daniel Hershcovich. Findings of EMNLP 2025.Recent advances in large language models (LLMs) have opened the door to culture-aware language tasks. We introduce the novel problem of adapting wine reviews across Chinese and English, which goes beyond literal translation by incorporating regional taste preferences and culture-specific flavor descriptors. In a case study on cross-cultural wine review adaptation, we compile the first parallel corpus of professional reviews, containing 8k Chinese and 16k Anglophone reviews. We benchmark both neural-machine-translation baselines and state-of-the-art LLMs with automatic metrics and human evaluation. For the latter, we propose three culture-oriented criteria -- Cultural Proximity, Cultural Neutrality, and Cultural Genuineness -- to assess how naturally a translated review resonates with target-culture readers. Our analysis shows that current models struggle to capture cultural nuances, especially in translating wine descriptions across different cultures. This highlights the challenges and limitations of translation models in handling cultural content. -
Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users.
Antonia Karamolegkou, Malvina Nikandrou, Georgios Pantazopoulos, Danae Sanchez Villegas, Phillip Rust, Ruchira Dhar, Daniel Hershcovich and Anders Søgaard. ACL 2025. SAC Highlight.This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals. We conduct a user survey to identify adoption patterns and key challenges users face with such technologies. Despite a high adoption rate of these models, our findings highlight concerns related to contextual understanding, cultural sensitivity, and complex scene understanding, particularly for individuals who may rely solely on them for visual interpretation. Informed by these results, we collate five user-centred tasks with image and video inputs, including a novel task on Optical Braille Recognition. Our systematic evaluation of twelve MLLMs reveals that further advancements are necessary to overcome limitations related to cultural context, multilingual support, Braille reading comprehension, assistive object recognition, and hallucinations. This work provides critical insights into the future direction of multimodal AI for accessibility, underscoring the need for more inclusive, robust, and trustworthy visual assistance technologies. -
Towards realistic evaluation of cultural value alignment in large language models: Diversity enhancement for survey response simulation.
Haijiang Liu, Yong Cao, Xun Wu, Chen Qiu, Jinguang Gu, Maofu Liu and Daniel Hershcovich. Information Processing & Management, Volume 62, Issue 4, July 2025.Assessing Large Language Models (LLMs) alignment with human values has been a high priority in natural language processing. These models, praised as reservoirs of collective human knowledge, provoke an important question: Do they genuinely reflect the value preferences embraced by different cultures? We measure value alignment by simulating sociological surveys and comparing the distribution of preferences from model responses to human references. We introduce a diversity-enhancement framework featuring a novel memory simulation mechanism, which enables the generation of model preference distributions and captures the diversity and uncertainty inherent in LLM behaviors through realistic survey experiments. To better understand the causes of misalignment, we have developed comprehensive evaluation metrics. Our analysis of multilingual survey data illustrates that our framework improves the reliability of cultural value alignment assessments and captures the complexity of model responses across cultural contexts. Among the eleven models evaluated, the Mistral and Llama-3 series show superior alignment with cultural values, with Mistral-series models notably excelling in comprehending these values in both U.S. and Chinese contexts. -
Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations.
Yong Cao, Haijiang Liu, Arnav Arora, Isabelle Augenstein, Paul Röttger and Daniel Hershcovich. NAACL 2025.Large-scale surveys are essential tools for informing social science research and policy, but running surveys is costly and time-intensive. If we could accurately simulate group-level survey results, this would therefore be very valuable to social science research. Prior work has explored the use of large language models (LLMs) for simulating human behaviors, mostly through prompting. In this paper, we are the first to specialize LLMs for the task of simulating survey response distributions. As a testbed, we use country-level results from two global cultural surveys. We devise a fine-tuning method based on first-token probabilities to minimize divergence between predicted and actual response distributions for a given question. Then, we show that this method substantially outperforms other methods and zero-shot classifiers, even on unseen questions, countries, and a completely unseen survey. While even our best models struggle with the task, especially on unseen questions, our results demonstrate the benefits of specialization for simulation, which may accelerate progress towards sufficiently accurate simulation in the future. -
Does Mapo Tofu Contain Coffee? Probing LLMs for Food-related Cultural Knowledge.
Li Zhou, Taelin Karidi, Wanlong Liu, Nicolas Garneau, Yong Cao, Wenyu Chen, Haizhou Li and Daniel Hershcovich. NAACL 2025.Recent studies have highlighted the presence of cultural biases in Large Language Models (LLMs), yet often lack a robust methodology to dissect these phenomena comprehensively. Our work aims to bridge this gap by delving into the Food domain, a universally relevant yet culturally diverse aspect of human life. We introduce FmLAMA, a multilingual dataset centered on food-related cultural facts and variations in food practices. We analyze LLMs across various architectures and configurations, evaluating their performance in both monolingual and multilingual settings. By leveraging templates in six different languages, we investigate how LLMs interact with language-specific and cultural knowledge. Our findings reveal that (1) LLMs demonstrate a pronounced bias towards food knowledge prevalent in the United States; (2) Incorporating relevant cultural context significantly improves LLMs' ability to access cultural knowledge; (3) The efficacy of LLMs in capturing cultural nuances is highly dependent on the interplay between the probing language, the specific model architecture, and the cultural context in question. This research underscores the complexity of integrating cultural understanding into LLMs and emphasizes the importance of culturally diverse datasets to mitigate biases and enhance model performance across different cultural domains. -
Beyond Words: Exploring Cultural Value Sensitivity in Multimodal Models.
Srishti Yadav, Zhi Zhang, Daniel Hershcovich and Ekaterina Shutova. Findings of NAACL 2025.Investigating value alignment in Large Language Models (LLMs) based on cultural context has become a critical area of research. However, similar biases have not been extensively explored in large vision-language models (VLMs). As the scale of multimodal models continues to grow, it becomes increasingly important to assess whether images can serve as reliable proxies for culture and how these values are embedded through the integration of both visual and textual data. In this paper, we conduct a thorough evaluation of multimodal model at different scales, focusing on their alignment with cultural values. Our findings reveal that, much like LLMs, VLMs exhibit sensitivity to cultural values, but their performance in aligning with these values is highly context-dependent. While VLMs show potential in improving value understanding through the use of images, this alignment varies significantly across contexts highlighting the complexities and underexplored challenges in the alignment of multimodal models. -
Unhappy Texts? A Gendered and Computational Rereading of The Modern Breakthrough.
Kirstine Nielsen Degn, Jens Bjerring-Hansen, Ali Al-Laith and Daniel Hershcovich. Scandinavian Studies, 2. 97, 2025.Our article discusses the hypothesis that the texts of women writers of the Modern Breakthrough in Scandinavia were particularly unhappy. We examine this common claim, along with some of the quantitative and qualitative issues it raises. Does this correlation of gender and affect hold true for the entire spectrum of women’s literary production from the era? What about male authorship and its affectivity? And what does ‘unhappy’ even mean? We confront this hypothesis and the associated questions through two interventions. The first is a quantification made possible by new digital archives and methodologies, which allow for a radical upscaling of the investigation’s empirical foundation. The second is to approach the nineteenth-century texts with a framework from the fields of gender studies and affect theory. Our findings are the following: (1) The thesis of the unhappy text appears partially true, but importantly, women are even more overrepresented among the positive texts. (2) The affect category of neutrality is more significant. Neutrality turns out to be a male, canonical enterprise, while low neutrality is primarily associated with forgotten or neglected women authors. The most crucial gender bias in the affective economy of the texts is the lack of neutrality in literature by women. (3) This and other biases point to clear intersectional dynamics between the author’s gender, the affective qualities and quantities of the texts and their social status. -
Annotating and Classifying Direct Speech in Historical Danish and Norwegian Literary Texts.
Ali Al-Laith, Alexander Conroy, Kirstine Nielsen Degn, Jens Bjerring-Hansen and Daniel Hershcovich. NoDaLiDa/Baltic-HLT 2025.Analyzing direct speech in historical literary texts provides insights into character dynamics, narrative style, and discourse patterns. In late 19th century Danish and Norwegian fiction direct speech reflects characters' social and geographical backgrounds. However, inconsistent typographic conventions in Scandinavian literature complicate computational methods for distinguishing direct speech from other narrative elements. To address this, we introduce an annotated dataset from the MeMo corpus, capturing speech markers and tags in Danish and Norwegian novels. We evaluate pre-trained language models for classifying direct speech, with results showing that a Danish Foundation Model (DFM), trained on extensive Danish data, has the highest performance. Finally, we conduct a classifier-assisted quantitative corpus analysis and find a downward trend in the prevalence of speech over time. -
Dying or Departing? Euphemism Detection for Death Discourse in Historical Texts.
Ali Al-Laith, Alexander Conroy, Jens Bjerring-Hansen, Bolette Pedersen, Carsten Levisen and Daniel Hershcovich. COLING 2025.Euphemisms are a linguistic device used to soften discussions of sensitive or uncomfortable topics, with death being a prominent example. In this paper, we present a study on the detection of death-related euphemisms in historical literary texts from a corpus containing Danish and Norwegian novels from the late 19th century. We introduce an annotated dataset of euphemistic and literal references to death, including both common and rare euphemisms, ranging from well-established terms to more culturally nuanced expressions. We evaluate the performances of state-of-the-art pre-trained language models fine-tuned for euphemism detection. Our findings show that fixed, literal expressions of death became less frequent over time, while metaphorical euphemisms grew in prevalence. Additionally, euphemistic language was more common in historical novels, whereas contemporary novels tended to refer to death more literally, reflecting the rise of secularism. These results shed light on the shifting discourse on death during a period when the concept of death as final became prominent. -
Literary Time Travel: Distinguishing Past and Contemporary Worlds in Danish and Norwegian Fiction.
Jens Bjerring-Hansen, Ali Al-Laith, Daniel Hershcovich, Alexander Conroy and Sebastian Ørtoft Rasmussen. CHR 2024.The classification of historical and contemporary novels is a nuanced task that has traditionally relied on expert literary analysis. This paper introduces a novel dataset comprising Danish and Norwegian novels from the last 30 years of the 19th century, annotated by literary scholars to distinguish between historical and contemporary works. While this manual classification is time-consuming and subjective, our approach leverages pre-trained language models to streamline and potentially standardize this process. We evaluate their effectiveness in automating this classification by examining their performance on titles and the first few sentences of each novel. After fine-tuning, the models show good performance but fail to fully capture the nuanced understanding exhibited by literary scholars. This research underscores the potential and limitations of NLP in literary genre classification and suggests avenues for further improvement, such as incorporating more sophisticated model architectures or hybrid methods that blend machine learning with expert knowledge. Our findings contribute to the broader field of computational humanities by highlighting the challenges and opportunities in automating literary analysis. -
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture.
Wenyan Li, Crystina Zhang, Jiaang Li, Qiwei Peng, Raphael Tang, Li Zhou, Weijia Zhang, Guimin Hu, Yifei Yuan anders Søgaard, Daniel Hershcovich and Desmond Elliott. EMNLP 2024.Food is a rich and varied dimension of cultural heritage, crucial to both individuals and social groups. To bridge the gap in the literature on the often-overlooked regional diversity in this domain, we introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China. We evaluate vision–language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions. FoodieQA comprises three multiple-choice question-answering tasks where models need to answer questions based on multiple images, a single image and text-only descriptions, respectively. While LLMs excel at text-based question answering, surpassing human accuracy, the open-sourced VLMs still fall short by 41% on multi-image and 21% on single-image VQA tasks, although closed-weights models perform closer to human levels (within 10%). Our findings highlight that understanding food and its cultural implications remains a challenging and under-explored direction. -
Noise, Novels, Numbers. A Framework for Detecting and Categorizing Noise in Danish and Norwegian Literature.
Ali Al-Laith, Daniel Hershcovich, Jens Bjerring-Hansen, Jakob Ingemann Parby, Alexander Conroy and Timothy R Tangherlini. EMNLP 2024.We present a framework for detecting and categorizing noise in literary texts, demonstrated through its application to Danish and Norwegian literature from the late 19-th century. Noise, understood as “aberrant sonic behaviour,” is not only an auditory phenomenon but also a cultural construct tied to the processes of civilization and urbanization.We begin by utilizing topic modeling techniques to identify noise-related documents, followed by fine-tuning BERT-based language models trained on Danish and Norwegian texts to analyze a corpus of over 800 novels.We identify and track the prevalence of noise in these texts, offering insights into the literary perceptions of noise during the Scandinavian “Modern Breakthrough” period (1870-1899). Our contributions include the development of a comprehensive dataset annotated for noise-related segments and their categorization into human-made, non-human-made and musical noises. This study illustrates the framework’s potential for enhancing the understanding of the relationship between noise and its literary representations, providing a deeper appreciation of the auditory elements in literary works, including as sources for cultural history. -
UniMEEC: Towards Unified Multimodal Emotion Recognition and Emotion Cause.
Guimin Hu, Zhihong Zhu, Daniel Hershcovich, Lijie Hu, Hasti Seifi and Jiayuan Xie. Findings of EMNLP 2024.Multimodal emotion recognition in conversation (MERC) and multimodal emotion-cause pair extraction (MECPE) have recently garnered significant attention. Emotions are the expression of affect or feelings; responses to specific events, or situations – known as emotion causes. Both collectively explain the causality between human emotion and intents. However, existing works treat emotion recognition and emotion cause extraction as two individual problems, ignoring their natural causality. In this paper, we propose a Unified Multimodal Emotion recognition and Emotion-Cause analysis framework (UniMEEC) to explore the causality between emotion and emotion cause. Concretely, UniMEEC reformulates the MERC and MECPE tasks as mask prediction problems and unifies them with a causal prompt template. To differentiate the modal effects, UniMEEC proposes a multimodal causal prompt to probe the pre-trained knowledge specified to modality and implements cross-task and cross-modality interactions under task-oriented settings. Experiment results on four public benchmark datasets verify the model performance on MERC and MECPE tasks and achieve consistent improvements compared with the previous state-of-the-art methods. -
Bridging Cultures in the Kitchen: A Framework and Benchmark for Cross-Cultural Recipe Retrieval.
Tianyi Hu, Maria Maistro and Daniel Hershcovich. EMNLP 2024.The cross-cultural adaptation of recipes is an important application of identifying and bridging cultural differences in language. The challenge lies in retaining the essence of the original recipe while also aligning with the writing and dietary habits of the target culture. Information Retrieval (IR) offers a way to address the challenge because it retrieves results from the culinary practices of the target culture while maintaining relevance to the original recipe. We introduce a novel task about cross-cultural recipe retrieval and present a unique Chinese-English cross-cultural recipe retrieval benchmark. Our benchmark is manually annotated under limited resource, utilizing various retrieval models to generate a pool of candidate results for manual annotation. The dataset provides retrieval samples that are culturally adapted but textually diverse, presenting greater challenges. We propose CARROT, a plug-and-play cultural-aware recipe information retrieval framework that incorporates cultural-aware query rewriting and re-ranking methods and evaluate it both on our benchmark and intuitive human judgments. The results show that our framework significantly enhances the preservation of the original recipe and its cultural appropriateness for the target culture. We believe these insights will significantly contribute to future research on cultural adaptation. -
Vision-Language models under Cultural and Inclusive Considerations.
Antonia Karamolegkou, Phillip Rust, Ruixiang Cui, Yong Cao anders Søgaard and Daniel Hershcovich. Human-Centered Large Language Modeling Workshop 2024.Large Vision Language Models can be used to assist visually impaired individuals by describing images they capture in their daily lives. Current evaluation datasets may not reflect the diverse cultural user backgrounds nor the situational context of this use case. To address this problem, we create a survey to determine caption preferences and propose a culture-centric evaluation benchmark by filtering VizWiz, an existing dataset with images taken by people who are blind. We then evaluate different models and prompts, investigating their reliability as visual assistants. While the evaluation results for state-of-the-art models seem promising, we identified some weak spots such as hallucinations and problems with conventional evaluation metrics. Our survey, data, code and model outputs will be publicly available. -
Can Abstract Meaning Representation Facilitate Fair Legal Judgement Predictions?.
Supriti Vijay and Daniel Hershcovich. Workshop on Insights from Negative Results in NLP 2024.Legal judgment prediction encompasses the automated prediction of case outcomes by leveraging historical facts and opinions. While this approach holds the potential to enhance the efficiency of the legal system, it also raises critical concerns regarding the perpetuation of biases. Abstract Meaning Representation has shown promise as an intermediate text representation in various downstream NLP tasks due to its ability to capture semantically meaningful information in a graph-like structure. In this paper, we employ this ability of AMR in the legal judgement prediction task and assess to what extent it encodes biases, or conversely, abstracts away from them. Our study reveals that while AMR-based models exhibit worse overall performance than transformer-based models, they are less biased for attributes like age and defendant state compared to gender. By shedding light on these findings, this paper contributes to a more nuanced understanding of AMR’s potential benefits and limitations in legal NLP. -
Automated Sentence Generation for a Spaced Repetition Software.
Benjamin Paddags, Daniel Hershcovich and Valkyrie Arline Savage. Workshop on Innovative Use of NLP for Building Educational Applications 2024. Outstanding Paper Award.This paper presents and tests AllAI, an app that utilizes state-of-the-art NLP technology to assist second language acquisition through a novel method of sentence-based spaced repetition. Diverging from current single word or fixed sentence repetition, AllAI dynamically combines words due for repetition into sentences, enabling learning words in context while scheduling them independently. This research explores various suitable NLP paradigms and finds a few-shot prompting approach and retrieval of existing sentences from a corpus to yield the best correctness and scheduling accuracy. Subsequently, it evaluates these methods on 26 learners of Danish, finding a four-fold increase in the speed at which new words are learned, compared to conventional spaced repetition. Users of the retrieval method also reported significantly higher enjoyment, hinting at a higher user engagement. -
CreoleVal: Multilingual Multitask Benchmarks for Creoles.
Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Hans Erik Heje, Diptesh Kanojia, Paul Belony, Marcel Bollmann, Loïc Grobol, Miryam de Lhoneux, Daniel Hershcovich, Michel DeGraff anders Søgaard and Johannes Bjerva. TACL, 2024.Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research. While the genealogical ties between Creoles and other highly-resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of brand new development datasets for machine comprehension, relation classification and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, the goal of CreoleVal is to empower research on Creoles in NLP and computational linguistics. We hope this resource will contribute to technological inclusion for Creole language users around the globe. -
Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts.
Ali Al-Laith, Alexander Conroy, Jens Bjerring-Hansen and Daniel Hershcovich. LREC-COLING 2024.We develop and evaluate the first pre-trained language models specifically tailored for historical Danish and Norwegian texts. Three models are trained on a corpus of 19th-century Danish and Norwegian literature: two directly on the corpus with no prior pre-training and one with continued pre-training. To evaluate the models, we utilize an existing sentiment classification dataset and additionally introduce a new annotated word sense disambiguation dataset focusing on the concept of fate. Our assessment reveals that the model employing continued pre-training outperforms the others in two downstream NLP tasks on historical texts. Specifically, we observe substantial improvement in sentiment classification and word sense disambiguation compared to models trained on contemporary texts. These results highlight the effectiveness of continued pre-training for enhancing performance across various NLP tasks in historical text analysis. -
Geo-Encoder: A Chunk-Argument Bi-Encoder Framework for Chinese Geographic Re-Ranking.
Yong Cao, Ruixue Ding, Boli Chen, Xianzhi Li, Min Chen, Daniel Hershcovich, Pengjun Xie and Fei Huang. EACL 2024.Chinese geographic re-ranking task aims to find the most relevant addresses among retrieved candidates, which is crucial for location-related services such as navigation maps. Unlike the general sentences, geographic contexts are closely intertwined with geographical concepts, from general spans (e.g., province) to specific spans (e.g., road). Given this feature, we propose an innovative framework, namely Geo-Encoder, to more effectively integrate Chinese geographical semantics into re-ranking pipelines. Our methodology begins by employing off-the-shelf tools to associate text with geographical spans, treating them as chunking units. Then, we present a multi-task learning module to simultaneously acquire an effective attention matrix that determines chunk contributions to extra semantic representations. Furthermore, we put forth an asynchronous update mechanism for the proposed addition task, aiming to guide the model capable of effectively focusing on specific chunks. Experiments on two distinct Chinese geographic re-ranking datasets, show that the Geo-Encoder achieves significant improvements when compared to state-of-the-art baselines. Notably, it leads to a substantial improvement in the Hit@1 score of MGEO-BERT, increasing it by 6.22% from 62.76 to 68.98 on the GeoTES dataset. -
Bridging Cultural Nuances in Dialogue Agents through Cultural Value Surveys.
Yong Cao, Min Chen, Daniel Hershcovich. Findings of EACL 2024.The cultural landscape of interactions with dialogue agents is a compelling yet relatively unexplored territory. It's clear that various sociocultural aspects -- from communication styles and beliefs to shared metaphors and knowledge -- profoundly impact these interactions. To delve deeper into this dynamic, we introduce cuDialog, a first-of-its-kind benchmark for dialogue generation with a cultural lens. We also develop baseline models capable of extracting cultural attributes from dialogue exchanges, with the goal of enhancing the predictive accuracy and quality of dialogue agents. To effectively co-learn cultural understanding and multi-turn dialogue predictions, we propose to incorporate cultural dimensions with dialogue encoding features. Our experimental findings highlight that incorporating cultural value surveys boosts alignment with references and cultural markers, demonstrating its considerable influence on personalization and dialogue quality. To facilitate further exploration in this exciting domain, we publish our benchmark publicly accessible at https://github.com/yongcaoplus/cuDialog. -
Cultural Adaptation of Recipes.
Yong Cao, Yova Kementchedjhieva, Ruixiang Cui, Antonia Karamolegkou, Li Zhou, Megan Dare, Lucia Donatelli and Daniel Hershcovich. TACL, 2023.Building upon the considerable advances in Large Language Models (LLMs), we are now equipped to address more sophisticated tasks demanding a nuanced understanding of cross-cultural contexts. A key example is recipe adaptation, which goes beyond simple translation to include a grasp of ingredients, culinary techniques and dietary preferences specific to a given culture. We introduce a new task involving the translation and cultural adaptation of recipes between Chinese and English-speaking cuisines. To support this investigation, we present CulturalRecipes, a unique dataset comprised of automatically paired recipes written in Mandarin Chinese and English. This dataset is further enriched with a human-written and curated test set. In this intricate task of cross-cultural recipe adaptation, we evaluate the performance of various methods, including GPT-4 and other LLMs, traditional machine translation and information retrieval techniques. Our comprehensive analysis includes both automatic and human evaluation metrics. While GPT-4 exhibits impressive abilities in adapting Chinese recipes into English, it still lags behind human expertise when translating English recipes into Chinese. This underscores the multifaceted nature of cultural adaptations. We anticipate that these insights will significantly contribute to future research on culturally-aware language models and their practical application in culturally diverse contexts. -
Cultural Compass: Predicting Transfer Learning Success in Offensive Language Detection with Cultural Features.
Li Zhou, Antonia Karamolegkou, Wenyu Chen and Daniel Hershcovich. Findings of EMNLP 2023.The increasing ubiquity of language technology necessitates a shift towards considering cultural diversity in the machine learning realm, particularly for subjective tasks that rely heavily on cultural nuances, such as Offensive Language Detection (OLD). Current understanding underscores that these tasks are substantially influenced by cultural values, however, a notable gap exists in determining if cultural features can accurately predict the success of cross-cultural transfer learning for such subjective tasks. Addressing this, our study delves into the intersection of cultural features and transfer learning effectiveness. The findings reveal that cultural value surveys indeed possess a predictive power for cross-cultural transfer learning success in OLD tasks and that it can be further improved using offensive word distance. Based on these results, we advocate for the integration of cultural information into datasets. Additionally, we recommend leveraging data sources rich in cultural information, such as surveys, to enhance cultural adaptability. Our research signifies a step forward in the quest for more inclusive, culturally sensitive language technologies. -
Probing for Hyperbole in Pre-Trained Language Models.
Nina Skovgaard Schneidermann, Daniel Hershcovich and Bolette Sandford Pedersen. ACL Student Research Workshop (SRW) 2023.Hyperbole is a common figure of speech, which is under-explored in NLP research. In this study, we conduct edge and minimal description length (MDL) probing experiments for three pre-trained language models (PLMs) in an attempt to explore the extent to which hyperbolic information is encoded in these models. We use both word-in-context and sentencelevel representations as model inputs as a basis for comparison. We also annotate 63 hyperbole sentences from the HYPO dataset according to an operational taxonomy to conduct an error analysis to explore the encoding of different hyperbole categories. Our results show that hyperbole is to a limited extent encoded in PLMs and mostly in the final layers. They also indicate that hyperbolic information may be better encoded by the sentence-level representations, which, due to the pragmatic nature of hyperbole, may therefore provide a more accurate and informative representation in PLMs. Finally, the inter-annotator agreement for our annotations, a Cohen’s Kappa of 0.339, suggest that the taxonomy categories may not be intuitive and need revision or simplification. -
On Evaluating Multilingual Compositional Generalization with Translated Datasets.
Zi Wang and Daniel Hershcovich. ACL 2023.Compositional generalization allows efficient learning and human-like inductive biases. Since most research investigating compositional generalization in NLP is done on English, important questions remain underexplored. Do the necessary compositional generalization abilities differ across languages? Can models compositionally generalize cross-lingually? As a first step to answering these questions, recent work used neural machine translation to translate datasets for evaluating compositional generalization in semantic parsing. However, we show that this entails critical semantic distortion. To address this limitation, we craft a faithful rule-based translation of the MCWQ dataset from English to Chinese and Japanese. Even with the resulting robust benchmark, which we call MCWQ-R, we show that the distribution of compositions still suffers due to linguistic divergences and that multilingual models still struggle with cross-lingual compositional generalization. Our dataset and methodology will be useful resources for the study of cross-lingual compositional generalization in other tasks. -
What does the Failure to Reason with "Respectively'' in Zero/Few-Shot Settings Tell Us about Language Models?.
Ruixiang Cui, Seolhwa Lee, Daniel Hershcovich and Anders Søgaard. ACL 2023.Humans can effortlessly understand the coordinate structure of sentences such as "Niels Bohr and Kurt Cobain were born in Copenhagen and Seattle, respectively". In the context of natural language inference (NLI), we examine how language models (LMs) reason with respective readings (Gawron and Kehler, 2004) from two perspectives: syntactic-semantic and commonsense-world knowledge. We propose a controlled synthetic dataset WikiResNLI and a naturally occurring dataset NatResNLI to encompass various explicit and implicit realizations of "carview.php?tsp=respectively". We show that fine-tuned NLI models struggle with understanding such readings without explicit supervision. While few-shot learning is easy in the presence of explicit cues, longer training is required when the reading is evoked implicitly, leaving models to rely on common sense inferences. Furthermore, our fine-grained analysis indicates models fail to generalize across different constructions. To conclude, we demonstrate that LMs still lag behind humans in generalizing to the long tail of linguistic constructions. -
What's the Meaning of Superhuman Performance in Today's NLU?.
Simone Tedeschi, Johan Bos, Thierry Declerck, Jan Hajic, Daniel Hershcovich, Eduard H. Hovy, Alexander Koller, Simon Krek, Steven Schockaert, Rico Sennrich, Ekaterina Shutova and Roberto Navigli. ACL 2023. Outstanding Paper Award.In the last five years, there has been a significant focus in Natural Language Processing (NLP) on developing larger Pretrained Language Models (PLMs) and introducing benchmarks such as SuperGLUE and SQuAD to measure their abilities in language understanding, reasoning and reading comprehension. These PLMs have achieved impressive results on these benchmarks, even surpassing human performance in some cases. This has led to claims of superhuman capabilities and the provocative idea that certain tasks have been solved. In this position paper, we take a critical look at these claims and ask whether PLMs truly have superhuman abilities and what the current benchmarks are really evaluating. We show that these benchmarks have serious limitations affecting the comparison between humans and PLMs and provide recommendations for fairer and more transparent benchmarks. -
Pay More Attention to Relation Exploration for Knowledge Base Question Answering.
Yong Cao, Xianzhi Li, huiwen liu, Wen Dai, shuai chen, Bin Wang, Min Chen and Daniel Hershcovich. Findings of ACL 2023.Knowledge base question answering (KBQA) is a challenging task that aims to retrieve correct answers from large-scale knowledge bases. Existing attempts primarily focus on entity representation and final answer reasoning, which results in limited supervision for this task. Moreover, the relations, which empirically determine the reasoning path selection, are not fully considered in recent advancements. In this study, we propose a novel framework, RE-KBQA, that utilizes relations in the knowledge base to enhance entity representation and introduce additional supervision. We explore guidance from relations in three aspects, including (1) distinguishing similar entities by employing a variational graph auto-encoder to learn relation importance; (2) exploring extra supervision by predicting relation distributions as soft labels with a multi-task scheme; (3) designing a relation-guided re-ranking algorithm for post-processing. Experimental results on two benchmark datasets demonstrate the effectiveness and superiority of our framework, improving the F1 score by 5.7% from 40.5 to 46.3 on CWQ and 5.8% from 62.8 to 68.5 on WebQSP, better or on par with state-of-the-art methods. -
Sentiment Classification of Historical Danish and Norwegian Literary Texts.
Ali Al-Laith, Kirstine Nielsen Degn, Alexander Conroy, Bolette S. Pedersen, Jens Bjerring-Hansen and Daniel Hershcovich. NoDaLiDa 2023.Sentiment classification is valuable for literary analysis, as sentiment is crucial in literary narratives. It can, for example, be used to investigate a hypothesis in the literary analysis of 19th-century Scandinavian novels that the writing of female authors in this period was characterized by negative sentiment, as this paper shows. In order to enable a data-driven analysis of this hypothesis, we create a manually annotated dataset of sentence-level sentiment annotations for novels from this period and use it to train and evaluate various sentiment classification methods. We find that pre-trained multilingual language models outperform models trained on modern Danish, as well as classifiers based on lexical resources. Finally, in classifier-assisted corpus analysis, we confirm the literary hypothesis regarding the author's gender and further shed light on the temporal development of the trend. Our dataset and trained models will be useful for future analysis of historical Danish and Norwegian literary texts. -
Cross-Cultural Transfer Learning for Chinese Offensive Language Detection.
Li Zhou, Laura Cabello, Yong Cao and Daniel Hershcovich. C3NLP 2023.Detecting offensive language is a challenging task. Generalizing across different cultures and languages becomes even more challenging: besides lexical, syntactic and semantic differences, pragmatic aspects such as cultural norms and sensitivities, which are particularly relevant in this context, vary greatly. In this paper, we target Chinese offensive language detection and aim to investigate the impact of transfer learning using offensive language detection data from different cultural backgrounds, specifically Korean and English. We find that culture-specific biases in what is considered offensive negatively impact the transferability of language models (LMs) and that LMs trained on diverse cultural data are sensitive to different features in Chinese offensive language detection. In a few-shot learning scenario, however, our study shows promising prospects for non-English offensive language detection with limited resources. Our findings highlight the importance of cross-cultural transfer learning in improving offensive language detection and promoting inclusive digital spaces. -
Assessing Cross-Cultural Alignment between ChatGPT and Human Societies: An Empirical Study.
Yong Cao, Li Zhou, Seolhwa Lee, Laura Cabello, Min Chen and Daniel Hershcovich. C3NLP 2023.The recent release of ChatGPT has garnered widespread recognition for its exceptional ability to generate human-like responses in dialogue. Given its usage by users from various nations and its training on a vast multilingual corpus that incorporates diverse cultural and societal norms, it is crucial to evaluate its effectiveness in cultural adaptation. In this paper, we investigate the underlying cultural background of ChatGPT by analyzing its responses to questions designed to quantify human cultural differences. Our findings suggest that, when prompted with American context, ChatGPT exhibits a strong alignment with American culture, but it adapts less effectively to other cultural contexts. Furthermore, by using different prompts to probe the model, we show that English prompts reduce the variance in model responses, flattening out cultural differences and biasing them towards American culture. This study provides valuable insights into the cultural implications of ChatGPT and highlights the necessity of greater diversity and cultural awareness in language technologies. -
A Two-Sided Discussion of Preregistration of NLP Research.
Anders Søgaard, Daniel Hershcovich and Miryam de Lhoneux. EACL 2023.Van Miltenburg et al. (2021) suggest NLP re- search should adopt preregistration to prevent fishing expeditions and to promote publication of negative results. At face value, this is a very reasonable suggestion, seemingly solving many methodological problems with NLP re- search. We discuss pros and cons—some old, some new: a) Preregistration is challenged by the practice of retrieving hypotheses after the results are known; b) preregistration may bias NLP toward confirmatory research; c) prereg- istration must allow for reclassification of re- search as exploratory; d) preregistration may in- crease publication bias; e) preregistration may increase flag-planting; f) preregistration may increase p-hacking; and finally, g) preregistra- tion may make us less risk tolerant. We cast our discussion as a dialogue, presenting both sides of the debate. -
A Dataset of Sustainable Diet Arguments on Twitter.
Marcus Astrup Hansen and Daniel Hershcovich. Workshop on NLP for Positive Impact 2022.Sustainable development requires a significant change in our dietary habits. Argument mining can help achieve this goal by both affecting and helping understand people's behavior. We design an annotation scheme for argument mining from online discourse around sustainable diets, including novel evidence types specific to this domain. Using Twitter as a source, we crowdsource a dataset of 597 tweets annotated in relation to 5 topics. We benchmark a variety of NLP models on this dataset, demonstrating strong performance in some sub-tasks, while highlighting remaining challenges. -
Can AMR Assist Legal and Logical Reasoning?.
Nikolaus Schrack, Ruixiang Cui, Hugo A. López and Daniel Hershcovich. Findings of EMNLP 2022 (long paper).Abstract Meaning Representation (AMR) has been shown to be useful for many downstream tasks. In this work, we explore the use of AMR for legal and logical reasoning, proposing neural architectures that utilize linearised AMR graphs in combination with pre-trained language models. While these models are not able to outperform text-only baselines, they correctly solve different instances than the text models, suggesting complementary abilities. Error analysis further reveals that AMR parsing quality is the most prominent challenge, especially regarding inputs with multiple sentences. We conduct a theoretical analysis of how logical relations are represented in AMR and conclude it might be helpful in some logical statements but not for others. -
Towards Climate Awareness in NLP Research.
Daniel Hershcovich, Nicolas Webersinke, Mathias Kraus, Julia Anna Bingler and Markus Leippold. EMNLP 2022 (long paper).The climate impact of AI and NLP research in particular, has become a serious issue given the enormous amount of energy that is increasingly being used for training and running computational models. Consequently, increasing focus is placed on efficient NLP. However, this important initiative lacks simple guidelines that would allow for systematic climate reporting of NLP research. We argue that this deficiency is one of the reasons why very few publications in NLP report key figures that would allow a more thorough examination of environmental impact. As a remedy, we propose a climate performance model card with the primary purpose of being practically usable with only limited information about experiments and the underlying computer hardware. We describe why this step is essential to increase awareness about the environmental impact of NLP research and, thereby, paving the way for more thorough discussions. -
Compositional Generalization in Multilingual Semantic Parsing over Wikidata.
Ruixiang Cui, Rahul Aralikatte, Heather Lent and Daniel Hershcovich. TACL, 2022.Semantic parsing (SP) allows humans to leverage vast knowledge resources through natural interaction. However, parsers are mostly designed for and evaluated on English resources, such as CFQ (Keysers et al., 2020), the current standard benchmark based on English data generated from grammar rules and oriented towards Freebase, an outdated knowledge base. We propose a method for creating a multilingual, parallel dataset of question-query pairs, grounded in Wikidata. We introduce such a dataset, which we call Multilingual Compositional Wikidata Questions (MCWQ) and use it to analyze the compositional generalization of semantic parsers in Hebrew, Kannada, Chinese and English. While within-language generalization is comparable across languages, experiments on zero-shot cross-lingual transfer demonstrate that cross-lingual compositional generalization fails, even with state-of-the-art pretrained multilingual encoders. Furthermore, our methodology, dataset and results will facilitate future research on SP in more realistic and diverse settings than has been possible with existing resources. -
Generalized Quantifiers as a Source of Error in Multilingual NLU Benchmarks.
Ruixiang Cui, Daniel Hershcovich and Anders Søgaard. NAACL 2022 (long paper).Logical approaches to representing language have developed and evaluated computational models of quantifier words since the 19th century, but today's NLU models still struggle to capture their semantics. We rely on Generalized Quantifier Theory for language-independent representations of the semantics of quantifier words, to quantify their contribution to the errors of NLU models. We find that quantifiers are pervasive in NLU benchmarks and their occurrence at test time is associated with performance drops. Multilingual models also exhibit unsatisfying quantifier reasoning abilities, but not necessarily worse for non-English languages. To facilitate directly-targeted probing, we present an adversarial generalized quantifier NLI task (GQNLI) and show that pre-trained language models have a clear lack of robustness in generalized quantifier reasoning. -
Evaluating Deep Taylor Decomposition for Reliability Assessment in the Wild.
Stephanie Brandl, Daniel Hershcovich and Anders Søgaard. ICWSM 2022.We argue that we need to evaluate model interpretability methods in the wild, in situations where professionals make critical decisions and models can potentially assist them. We present an in-the-wild evaluation of token attribution based on Deep Taylor Decomposition, with professional journalists performing reliability assessments. We find that using this method in conjunction with RoBERTa-Large, fine-tuned on the Gossip Corpus, led to faster and better human decision-making, as well as a more critical attitude toward news sources among the journalists. We present a comparison of human and model rationales, as well as a qualitative analysis of the journalists' experiences with machine-in-the-loop decision making. -
Challenges and Strategies in Cross-Cultural NLP.
Daniel Hershcovich, Stella Frank, Heather Lent, Miryam de Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui, Constanza Fierro, Katerina Margatina, Phillip Rust and Anders Søgaard. ACL 2022 (long paper).Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogous to cross-lingual and multilingual NLP, cross-cultural and multicultural NLP considers these differences in order to better serve users of NLP systems. We propose a principled framework to frame these efforts and survey existing and potential strategies. -
Scaling Creative Inspiration with Fine-Grained Functional Facets of Product Ideas.
Tom Hope, Ronen Tamari, Hyeonsu Kang, Daniel Hershcovich, Joel Chan, Aniket Kittur and Dafna Shahaf. CHI 2022.Web-scale repositories of products, patents and scientific papers offer an opportunity for creating automated systems that scour millions of ideas and assist users in discovering inspirations and solutions. Yet the common representation of ideas is in the form of raw textual descriptions, lacking important structure that is required for supporting creative innovation. Prior work has pointed to the importance of functional structure -- capturing the mechanisms and purposes of inventions -- for allowing users to discover structural connections across ideas and creatively adapt existing technologies. However, the use of functional representations was either coarse and limited in expressivity, or dependent on curated knowledge bases with poor coverage and significant manual effort from users. To help bridge this gap and unlock the potential of large-scale idea mining, we propose a novel computational representation that automatically breaks up products into fine-grained functional facets. We train a model to extract these facets from a challenging real-world corpus of invention descriptions and represent each product as a set of facet embeddings. We design similarity metrics that support granular matching between functional facets across ideas and use them to build a novel functional search capability that enables expressive queries for mechanisms and purposes. We construct a graph capturing hierarchical relations between purposes and mechanisms across an entire corpus of products and use the graph to help problem-solvers explore the design space around a focal problem and view related problem perspectives. In empirical user studies, our approach leads to a significant boost in search accuracy and in the quality of creative inspirations, outperforming strong baselines and state-of-art representations of product texts by 50-60%. -
A Multilingual Benchmark for Probing Negation-Awareness with Minimal Pairs.
Mareike Hartmann, Miryam de Lhoneux, Daniel Hershcovich, Yova Kementchedjhieva, Lukas Nielsen, Chen Qiu and Anders Søgaard.
CoNLL 2021.Negation is one of the most fundamental concepts in human cognition and language and several natural language inference (NLI) probes have been designed to investigate pretrained language models' ability to detect and reason with negation. However, the existing probing datasets are limited to English only and do not enable controlled probing of performance in the absence or presence of negation. In response, we present a multilingual (English, Bulgarian, German, French and Chinese) benchmark collection of NLI examples that are grammatical and correctly labeled, as a result of manual inspection and editing. We use the benchmark to probe the negation-awareness of multilingual language models and find that models that correctly predict examples with negation cues often fail to correctly predict their counter-examples {\em without} negation cues, even when the cues are irrelevant for semantic inference. -
Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color.
Mostafa Abdou, Artur Kulmizev, Daniel Hershcovich, Stella Frank, Ellie Pavlick and Anders Søgaard.
CoNLL 2021.Pretrained language models have been shown to encode relational information, such as the relations between entities or concepts in knowledge-bases -- (Paris, Capital, France). However, simple relations of this type can often be recovered heuristically and the extent to which models implicitly reflect topological structure that is grounded in world, such as perceptual structure, is unknown. To explore this question, we conduct a thorough case study on color. Namely, we employ a dataset of monolexemic color terms and color chips represented in CIELAB, a color space with a perceptually meaningful distance metric. Using two methods of evaluating the structural alignment of colors in this space with text-derived color term representations, we find significant correspondence. Analyzing the differences in alignment across the color spectrum, we find that warmer colors are, on average, better aligned to the perceptual color space than cooler ones, suggesting an intriguing connection to findings from recent work on efficient communication in color naming. Further analysis suggests that differences in alignment are, in part, mediated by collocationality and differences in syntactic usage, posing questions as to the relationship between color perception and usage and context. -
Great Service! Fine-grained Parsing of Implicit Arguments.
Ruixiang Cui and Daniel Hershcovich.
IWPT 2021.Broad-coverage meaning representations in NLP mostly focus on explicitly expressed content. More importantly, the scarcity of datasets annotating diverse implicit roles limits empirical studies into their linguistic nuances. For example, in the web review "Great service!", the provider and consumer are implicit arguments of different types. We examine an annotated corpus of fine-grained implicit arguments (Cui and Hershcovich, 2020) by carefully re-annotating it, resolving several inconsistencies. Subsequently, we present the first transition-based neural parser that can handle implicit arguments dynamically and experiment with two different transition systems on the improved dataset. We find that certain types of implicit arguments are more difficult to parse than others and that the simpler system is more accurate in recovering implicit arguments, despite having a lower overall parsing score, attesting current reasoning limitations of NLP models. This work will facilitate a better understanding of implicit and underspecified language, by incorporating it holistically into meaning representations. -
Lexical Semantic Recognition.
Nelson F. Liu, Daniel Hershcovich, Michael Kranzlein, Nathan Schneider.
MWE 2021.In lexical semantics, full-sentence segmentation and segment labeling of various phenomena are generally treated separately, despite their interdependence. We hypothesize that a unified lexical semantic recognition task is an effective way to encapsulate previously disparate styles of annotation, including multiword expression identification / classification and supersense tagging. Using the STREUSLE corpus, we train a neural CRF sequence tagger and evaluate its performance along various axes of annotation. As the label set generalizes that of previous tasks (PARSEME, DiMSUM), we additionally evaluate how well the model generalizes to those test sets, finding that it approaches or surpasses existing models despite training only on STREUSLE. Our work also establishes baseline models and evaluation metrics for integrated and accurate modeling of lexical semantics, facilitating future work in this area. -
It’s the Meaning That Counts: The State of the Art in NLP and Semantics.
Daniel Hershcovich and Lucia Donatelli.
KI - Künstliche Intelligenz, 2021.Semantics, the study of meaning, is central to research in Natural Language Processing (NLP) and many other fields connected to Artificial Intelligence. Nevertheless, how semantics is understood in NLP ranges from traditional, formal linguistic definitions based on logic and the principle of compositionality to more applied notions based on grounding meaning in real-world objects and real-time interaction. “Semantic” methods may additionally strive for meaningful representation of language that integrates broader aspects of human cognition and embodied experience, calling into question how adequate a representation of meaning based on linguistic signal alone is for current research agendas. We review the state of computational semantics in NLP and investigate how different lines of inquiry reflect distinct understandings of semantics and prioritize different layers of linguistic meaning. In conclusion, we identify several important goals of the field and describe how current research addresses them. -
How far can we get with one GPU in 100 hours? CoAStaL at MultiIndicMT Shared Task.
Rahul Aralikatte, Héctor Ricardo Murrieta Bello, Daniel Hershcovich, Marcel Bollmann anders Søgaard.
MultiIndicMT Shared Task 2021.This work shows that competitive translation results can be obtained in a constrained setting by incorporating the latest advances in memory and compute optimization. We train and evaluate large multilingual translation models using a single GPU for a maximum of 100 hours and get within 4-5 BLEU points of the top submission on the leaderboard. We also benchmark standard baselines on the PMI corpus and re-discover well-known shortcomings of translation systems and metrics. -
Moses and the Character-Based Random Babbling Baseline: CoAStaL at AmericasNLP 2021 Shared Task.
Marcel Bollmann, Rahul Aralikatte, Héctor Murrieta Bello, Daniel Hershcovich, Miryam de Lhoneux anders Søgaard.
AmericasNLP Shared Task 2021.We evaluated a range of neural machine translation techniques developed specifically for low-resource scenarios. Unsuccessfully. In the end, we submitted two runs: (i) a standard phrase-based model and (ii) a random babbling baseline using character trigrams. We found that it was surprisingly hard to beat (i), in spite of this model being, in theory, a bad fit for polysynthetic languages; and more interestingly, that (ii) was better than several of the submitted systems, highlighting how difficult low-resource machine translation for polysynthetic languages is. -
An autonomous debating system.
Noam Slonim, Yonatan Bilu, Carlos Alzate, Roy Bar-Haim, Ben Bogin, Francesca Bonin, Leshem Choshen, Edo Cohen-Karlik, Lena Dankin, Lilach Edelstein, Liat Ein-Dor, Roni Friedman-Melamed, Assaf Gavron, Ariel Gera, Martin Gleize, Shai Gretz, Dan Gutfreund, Alon Halfon, Daniel Hershcovich, Ron Hoory, Yufang Hou, Shay Hummel, Michal Jacovi, Charles Jochim, Yoav Kantor, Yoav Katz, David Konopnicki, Zvi Kons, Lili Kotlerman, Dalia Krieger, Dan Lahav, Tamar Lavee, Ran Levy, Naftali Liberman, Yosi Mass, Amir Menczel, Shachar Mirkin, Guy Moshkowich, Shila Ofek-Koifman, Matan Orbach, Ella Rabinovich, Ruty Rinott, Slava Shechtman, Dafna Sheinwald, Eyal Shnarch, Ilya Shnayderman, Aya Soffer, Artem Spector, Benjamin Sznajder, Assaf Toledo, Orith Toledo-Ronen, Elad Venezian and Ranit Aharonov.
Nature, 2021.Artificial intelligence (AI) is defined as the ability of machines to perform tasks that are usually associated with intelligent beings. Argument and debate are fundamental capabilities of human intelligence, essential for a wide range of human activities and common to all human societies. The development of computational argumentation technologies is therefore an important emerging discipline in AI research1. Here we present Project Debater, an autonomous debating system that can engage in a competitive debate with humans. We provide a complete description of the system’s architecture, a thorough and systematic evaluation of its operation across a wide range of debate topics and a detailed account of the system’s performance in its public debut against three expert human debaters. We also highlight the fundamental differences between debating with humans as opposed to challenging humans in game competitions, the latter being the focus of classical ‘grand challenges’ pursued by the AI research community over the past few decades. We suggest that such challenges lie in the ‘comfort zone’ of AI, whereas debating with humans lies in a different territory, in which humans still prevail and for which novel paradigms are required to make substantial progress. -
Joint Semantic Analysis with Document-Level Cross-Task Coherence Rewards.
Rahul Aralikatte, Mostafa Abdou, Heather Lent, Daniel Hershcovich and Anders Søgaard.
AAAI 2021 (long paper).Coreference resolution and semantic role labeling are NLP tasks that capture different aspects of semantics, indicating respectively, which expressions refer to the same entity and what semantic roles expressions serve in the sentence. However, they are often closely interdependent and both generally necessitate natural language understanding. Do they form a coherent abstract representation of documents? We present a neural network architecture for joint coreference resolution and semantic role labeling for English and train graph neural networks to model the 'coherence' of the combined shallow semantic graph. Using the resulting coherence score as a reward for our joint semantic analyzer, we use reinforcement learning to encourage global coherence over the document and between semantic annotations. This leads to improvements on both tasks in multiple datasets from different domains and across a range of encoders of different expressivity, calling, we believe, for a more holistic approach for semantics in NLP. -
Predicate-argument structure analysis is a central component in meaning representations of text. The fact that some arguments are not explicitly mentioned in a sentence gives rise to ambiguity in language understanding and renders it difficult for machines to interpret text correctly. However, only few resources represent implicit roles for NLU and existing studies in NLP only make coarse distinctions between categories of arguments omitted from linguistic form. To better understand the behaviour of implicit roles and their characteristics, in this paper, we design a typology for fine-grained implicit argument annotation on top of Universal Conceptual Cognitive Annotation's foundational layer. The proposed implicit argument categorisation is driven by theories of implicit role interpretation and consists of six types: Deictic, Generic, Genre-based, Type-identifiable, Non-specific and Iterated-set. We exemplify our design by revisiting part of the UCCA EWT corpus, providing a new dataset annotated with the refinement layer and making a comparative analysis with other schemes both in terms of quantity and quality.
-
Comparison by Conversion: Reverse-Engineering UCCA from Syntax and Lexical Semantics.
Daniel Hershcovich, Nathan Schneider, Dotan Dvir, Jakob Prange, Miryam de Lhoneux and Omri Abend.
COLING 2020.Building robust natural language understanding systems will require a clear characterization of whether and how various linguistic meaning representations complement each other. To perform a systematic comparative analysis, we evaluate the mapping between meaning representations from different frameworks using two complementary methods: (i) a rule-based converter and (ii) a supervised delexicalized parser that parses to one framework using only information from the other as features. We apply these methods to convert the STREUSLE corpus (with syntactic and lexical semantic annotations) to UCCA (a graph-structured full-sentence meaning representation). Both methods yield surprisingly accurate target representations, close to fully supervised UCCA parser quality---indicating that UCCA annotations are partially redundant with STREUSLE annotations. Despite this substantial convergence between frameworks, we find several important areas of divergence. -
MRP 2020: The Second Shared Task on Cross-Framework and Cross-Lingual Meaning Representation Parsing.
Stephan Oepen, Omri Abend, Lasha Abzianidze, Johan Bos, Jan Hajic, Daniel Hershcovich, Bin Li, Tim O'Gorman, Nianwen Xue and Daniel Zeman.
CoNLL 2020 shared task.The 2020 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks and languages. Extending a similar setup from the previous year, five distinct approaches to the representation of sentence meaning in the form of directed graphs were represented in the English training and evaluation data for the task, packaged in a uniform graph abstraction and serialization; for four of these representation frameworks, additional training and evaluation data was provided for one additional language per framework. The task received submissions from eight teams, of which two do not participate in the official ranking because they arrived after the closing deadline or made use of additional training data. All technical information regarding the task, including system submissions, official results and links to supporting resources and software are available from the task web site at: https://mrp.nlpl.eu -
HUJI-KU at MRP 2020: Two Transition-based Neural Parsers.
Ofir Arviv, Ruixiang Cui and Daniel Hershcovich.
CoNLL 2020 shared task.This paper describes the HUJI-KU system submission to the shared task on Cross-Framework Meaning Representation Parsing (MRP) at the 2020 Conference for Computational Language Learning (CoNLL), employing TUPA and the HIT-SCIR parser, which were, respectively, the baseline system and winning system in the 2019 MRP shared task. Both are transition-based parsers using BERT contextualized embeddings. We generalized TUPA to support the newly-added MRP frameworks and languages, and experimented with multitask learning with the HIT-SCIR parser. We reached 4th place in both the cross-framework and cross-lingual tracks. -
Køpsala: Transition-Based Graph Parsing via Efficient Training and Effective Encoding.
Daniel Hershcovich, Miryam de Lhoneux, Artur Kulmizev, Elham Pejhan and Joakim Nivre.
IWPT 2020 shared task.We present Køpsala, the Copenhagen-Uppsala system for the Enhanced Universal Dependencies Shared Task at IWPT 2020. Our system is a pipeline consisting of off-the-shelf models for everything but enhanced graph parsing and for the latter, a transition-based graph parser adapted from Che et al. (2019). We train a single enhanced parser model per language, using gold sentence splitting and tokenization for training and rely only on tokenized surface forms and multilingual BERT for encoding. While a bug introduced just before submission resulted in a severe drop in precision, its post-submission fix would bring us to 4th place in the official ranking, according to average ELAS. Our parser demonstrates that a unified pipeline is effective for both Meaning Representation Parsing and Enhanced Universal Dependencies. -
MRP 2019: Cross-Framework Meaning Representation Parsing.
Stephan Oepen, Omri Abend, Jan Hajič, Daniel Hershcovich, Marco Kuhlmann, Tim O'Gorman, Nianwen Xue, Jayeol Chun, Milan Straka and Zdeňka Urešová.
CoNLL 2019 shared task.The 2019 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks. Five distinct approaches to the representation of sentence meaning in the form of directed graph were represented in the training and evaluation data for the task, packaged in a uniform abstract graph representation and serialization. The task received submissions from eighteen teams, of which five do not participate in the official ranking because they arrived after the closing deadline, made use of additional training data, or involved one of the task co-organizers. All technical information regarding the task, including system submissions, official results and links to supporting resources and software are available from the task web site at: https://mrp.nlpl.eu -
TUPA at MRP 2019: A Multi-Task Baseline System.
Daniel Hershcovich and Ofir Arviv.
CoNLL 2019 shared task.This paper describes the TUPA system submission to the shared task on CrossFramework Meaning Representation Parsing (MRP) at the 2019 Conference for Computational Language Learning (CoNLL). TUPA provides a baseline point of comparison and is not considered in the official ranking of participating systems. While originally developed for UCCA only, TUPA has been generalized to support all MRP frameworks included in the task and trained using multi-task learning to parse them all with a shared model. It is a transition-based parser with a BiLSTM encoder, augmented with BERT contextualized embeddings. -
Rewarding Coreference Resolvers for Being Consistent with World Knowledge.
Rahul Aralikatte, Heather Lent, Ana Valeria Gonzalez, Daniel Hershcovich, Chen Qiu, Anders Sandholm, Michael Ringgaard and Anders Søgaard.
EMNLP-IJCNLP 2019 (short paper).Unresolved coreference is a bottleneck for relation extraction and high-quality coreference resolvers may produce an output that makes it a lot easier to extract knowledge triples. We show how to improve coreference resolvers by forwarding their input to a relation extraction system and reward the resolvers for producing triples that are found in knowledge bases. Since relation extraction systems can rely on different forms of supervision and be biased in different ways, we obtain the best performance, improving over the state of the art, using multi-task reinforcement learning. -
The Language of Legal and Illegal Activity on the Darknet.
Leshem Choshen*, Dan Eldad*, Daniel Hershcovich*, Elior Sulem* and Omri Abend.
ACL 2019 (long paper).The non-indexed parts of the Internet (the Darknet) have become a haven for both legal and illegal anonymous activity. Given the magnitude of these networks, scalably monitoring their activity necessarily relies on automated tools and notably on NLP tools. However, little is known about what characteristics texts communicated through the Darknet have and how well off-the-shelf NLP tools do on this domain. This paper tackles this gap and performs an in-depth investigation of the characteristics of legal and illegal text in the Darknet, comparing it to a clear net website with similar content as a control condition. Taking drug-related websites as a test case, we find that texts for selling legal and illegal drugs have several linguistic characteristics that distinguish them from one another, as well as from the control condition, among them the distribution of POS tags and the coverage of their named entities in Wikipedia. -
Argument Invention from First Principles.
Yonatan Bilu, Ariel Gera, Daniel Hershcovich, Benjamin Sznajder, Dan Lahav, Guy Moshkowich, Anael Malet, Assaf Gavron and Noam Slonim.
ACL 2019 (long paper).Competitive debaters often find themselves facing a challenging task -- how to debate a topic they know very little about, with only minutes to prepare and without access to books or the Internet? What they often do is rely on ''first principles'', commonplace arguments which are relevant to many topics and which they have refined in past debates. In this work we aim to explicitly define a taxonomy of such principled recurring arguments and, given a controversial topic, to automatically identify which of these arguments are relevant to the topic. As far as we know, this is the first time that this approach to argument invention is formalized and made explicit in the context of NLP. The main goal of this work is to show that it is possible to define such a taxonomy. While the taxonomy suggested here should be thought of as a ''first attempt'' it is nonetheless coherent, covers well the relevant topics and coincides with what professional debaters actually argue in their speeches and facilitates automatic argument invention for new topics. -
SemEval 2019 Task 1: Cross-lingual Semantic Parsing with UCCA.
Daniel Hershcovich, Zohar Aizenbud, Leshem Choshen, Elior Sulem, Ari Rappoport and Omri Abend.
SemEval 2019 shared task.We present the SemEval 2019 shared task on Universal Conceptual Cognitive Annotation (UCCA) parsing in English, German and French and discuss the participating systems and results. UCCA is a cross-linguistically applicable framework for semantic representation, which builds on extensive typological work and supports rapid annotation. UCCA poses a challenge for existing parsing techniques, as it exhibits reentrancy (resulting in DAG structures), discontinuous structures and non-terminal nodes corresponding to complex semantic units. The shared task has yielded improvements over the state-of-the-art baseline in all languages and settings. Full results can be found in the task's website. -
Syntactic Interchangeability in Word Embedding Models.
Daniel Hershcovich, Assaf Toledo, Alon Halfon and Noam Slonim.
RepEval 2019.Nearest neighbors in word embedding models are commonly observed to be semantically similar, but the relations between them can vary greatly. We investigate the extent to which word embedding models preserve syntactic interchangeability, as reflected by distances between word vectors and the effect of hyper-parameters---context window size in particular. We use part of speech (POS) as a proxy for syntactic interchangeability, as generally speaking, words with the same POS are syntactically valid in the same contexts. We also investigate the relationship between interchangeability and similarity as judged by commonly-used word similarity benchmarks and correlate the result with the performance of word embedding models on these benchmarks. Our results will inform future research and applications in the selection of word embedding model, suggesting a principle for an appropriate selection of the context window size parameter depending on the use-case. -
Content Differences in Syntactic and Semantic Representations.
Daniel Hershcovich, Omri Abend and Ari Rappoport.
NAACL 2019 (long paper).Syntactic analysis plays an important role in semantic parsing, but this role remains a topic of ongoing debate. The debate has been constrained by the scarcity of empirical comparative studies between syntactic and semantic schemes, which hinders the development of parsing methods informed by the details of target schemes and constructions. We target this gap and take Universal Dependencies (UD) and UCCA as a test case. After abstracting away from differences of convention or formalism, we find that most content divergences can be ascribed to: (1) UCCA's distinction between a Scene and a non-Scene; (2) UCCA's distinction between primary relations, secondary ones and participants; (3) different treatment of multi-word expressions and (4) different treatment of inter-clause linkage. We further discuss the long tail of cases where the two schemes take markedly different approaches. Finally, we show that the proposed comparison methodology can be used for fine-grained evaluation of UCCA parsing, highlighting both challenges and potential sources for improvement. The substantial differences between the schemes suggest that semantic parsers are likely to benefit downstream text understanding applications beyond their syntactic counterparts. -
Universal Dependency Parsing with a General Transition-Based DAG Parser.
Daniel Hershcovich, Omri Abend and Ari Rappoport.
CoNLL 2018 UD Shared Task.This paper presents our experiments with applying TUPA to the CoNLL 2018 UD shared task. TUPA is a general neural transition-based DAG parser, which we use to present the first experiments on recovering enhanced dependencies as part of the general parsing task. TUPA was designed for parsing UCCA, a cross-linguistic semantic annotation scheme, exhibiting reentrancy, discontinuity and non-terminal nodes. By converting UD trees and graphs to a UCCA-like DAG format, we train TUPA almost without modification on the UD parsing task. The generic nature of our approach lends itself naturally to multitask learning. -
Multitask Parsing Across Semantic Representations.
Daniel Hershcovich, Omri Abend and Ari Rappoport.
ACL 2018 (long paper).The ability to consolidate information of different types is at the core of intelligence and has tremendous practical value in allowing learning for one task to benefit from generalizations learned for others. In this paper we tackle the challenging task of improving semantic parsing performance, taking UCCA parsing as a test case and AMR, SDP and Universal Dependencies (UD) parsing as auxiliary tasks. We experiment on three languages, using a uniform transition-based system and learning architecture for all parsing tasks. Despite notable conceptual, formal and domain differences, we show that multitask learning significantly improves UCCA parsing in both in-domain and out-of-domain settings. -
A Transition-Based Directed Acyclic Graph Parser for UCCA.
Daniel Hershcovich, Omri Abend and Ari Rappoport.
ACL 2017 (long paper). Outstanding Paper Award.We present the first parser for UCCA, a cross-linguistically applicable framework for semantic representation, which builds on extensive typological work and supports rapid annotation. UCCA poses a challenge for existing parsing techniques, as it exhibits reentrancy (resulting in DAG structures), discontinuous structures and non-terminal nodes corresponding to complex semantic units. To our knowledge, the conjunction of these formal properties is not supported by any existing parser. Our transition-based parser, which uses a novel transition set and features based on bidirectional LSTMs, has value not just for UCCA parsing: its ability to handle more general graph structures can inform the development of parsers for other semantic DAG structures and in languages that frequently use discontinuous structures. -
Automatic Claim Negation: Why, How and When.
Yonatan Bilu, Daniel Hershcovich, Noam Slonim.
NAACL HLT 2015 (long paper).The main goal of argumentation mining is to analyze argumentative structures within an argument-rich document and reason about their composition. Recently, there is also interest in the task of simply detecting claims (sometimes called conclusion) in general documents. In this work we ask how this set of detected claims can be augmented further, by adding to it the negation of each detected claim. This presents two NLP problems: how to automatically negate a claim and when such a negated claim can plausibly be used. We present first steps into solving both these problems, using a rule-based approach for the former and a statistical one towards the latter. -
Context Dependent Claim Detection.
Ran Levy, Yonatan Bilu, Daniel Hershcovich, Ehud Aharoni and Noam Slonim.
COLING 2014 (long paper).While discussing a concrete controversial topic, most humans will find it challenging to swiftly raise a diverse set of convincing and relevant claims that should set the basis of their arguments. Here, we formally define the challenging task of automatic claim detection in a given context and discuss its associated unique difficulties. Further, we outline a preliminary solution to this task and assess its performance over annotated real world data, collected specifically for that purpose over hundreds of Wikipedia articles. We report promising results of a supervised learning approach, which is based on a cascade of classifiers designed to properly handle the skewed data which is inherent to the defined task. These results demonstrate the viability of the introduced task. -
Claims on Demand–an Initial Demonstration of a System for Automatic Detection and Polarity Identification of Context Dependent Claims in Massive Corpora.
Ehud Aharoni, Carlos Alzate, Roy Bar-Haim, Yonatan Bilu, Lena Dankin, Iris Eiron, Daniel Hershcovich, Shay Hummel.
COLING 2014 (system demonstration).While discussing a concrete controversial topic, most humans will find it challenging to swiftly raise a diverse set of convincing and relevant claims that should set the basis of their arguments. Here, we demonstrate the initial capabilities of a system that, given a controversial topic, can automatically pinpoint relevant claims in Wikipedia, determine their polarity with respect to the given topic and articulate them per the user's request. -
A Benchmark Dataset for Automatic Detection of Claims and Evidence in the Context of Controversial Topics.
Ehud Aharoni, Anatoly Polnarov, Tamar Lavee, Daniel Hershcovich, Ran Levy, Ruty Rinott, Dan Gutfreund, Noam Slonim.
Workshop on Argumentation Mining at ACL 2014.We describe a novel and unique argumentative structure dataset. This corpus consists of data extracted fro m hundreds of Wikipedia articles using a meticulously monitored manual annotation process. The result is 2,683 argument elements, collected in the context of 33 controversial topics, organized under a simple claim-evidence structure. The obtained data are publicly available for academic research. -
Verification of Transactional Memory in POWER8.
Alon Adir, Dave Goodman, Daniel Hershcovich, Oz Hershkovitz, Bryan Hickerson, Karen Holtz, Wisam Kadry, Anatoly Koyfman, John Ludden, Charles Meissner, Amir Nahir, Randall R Pratt, Mike Schiffli, Brett St Onge, Brian Thompto, Elena Tsanko, Avi Ziv.
DAC 2014.Transactional memory is a promising mechanism for synchronizing concurrent programs that eliminates locks at the expense of hardware complexity. Transactional memory is a hard feature to verify. First, transactions comprise several instructions that must be observed as a single global atomic operation. In addition, there are many reasons a transaction can fail. This results in a high level of non-determinism which must be tamed by the verification methodology. This paper describes the innovation that was applied to tools and methodology in pre-silicon simulation, acceleration and post-silicon in order to verify transactional memory in the IBM POWER8 processor core. -
Exploring Visual Culture Awareness in GPT-4V: A Comprehensive Probing.
Yong Cao, Wenyan Li, Jiaang Li, Yifei Yuan and Daniel Hershcovich.Pretrained large Vision-Language models have drawn considerable interest in recent years due to their remarkable performance. Despite considerable efforts to assess these models from diverse perspectives, the extent of visual cultural awareness in the state-of-the-art GPT-4V model remains unexplored. To tackle this gap, we extensively probed GPT-4V using the MaRVL benchmark dataset, aiming to investigate its capabilities and limitations in visual understanding with a focus on cultural aspects. Specifically, we introduced three visual related tasks, i.e. caption classification, pairwise captioning and culture tag selection, to systematically delve into fine-grained visual cultural evaluation. Experimental results indicate that GPT-4V excels at identifying cultural concepts but still exhibits weaker performance in low-resource languages, such as Tamil and Swahili. Notably, through human evaluation, GPT-4V proves to be more culturally relevant in image captioning tasks than the original MaRVL human annotations, suggesting a promising solution for future visual cultural benchmark construction. -
Revisiting Graph Meaning Representations through Decoupling Contextual Representation Learning and Structural Information Propagation.
Li Zhou, Wenyu Chen, Dingyi Zeng, Hong Qu and Daniel Hershcovich.In the field of natural language understanding, the intersection of neural models and graph meaning representations (GMRs) remains a compelling area of research. Despite the growing interest, a critical gap persists in understanding the exact influence of GMRs, particularly concerning relation extraction tasks. Addressing this, we introduce DAGNN-plus, a simple and parameter-efficient neural architecture designed to decouple contextual representation learning from structural information propagation. Coupled with various sequence encoders and GMRs, this architecture provides a foundation for systematic experimentation on two English and two Chinese datasets. Our empirical analysis utilizes four different graph formalisms and nine parsers. The results yield a nuanced understanding of GMRs, showing improvements in three out of the four datasets, particularly favoring English over Chinese due to highly accurate parsers. Interestingly, GMRs appear less effective in literary-domain datasets compared to general-domain datasets. These findings lay the groundwork for better-informed design of GMRs and parsers to improve relation classification, which is expected to tangibly impact the future trajectory of natural language understanding research. -
Does injecting linguistic structure into language models lead to better alignment with brain recordings?.
Mostafa Abdou, Ana Valeria Gonzalez, Mariya Toneva, Daniel Hershcovich and Anders Søgaard.Neuroscientists evaluate deep neural networks for natural language processing as possible candidate models for how language is processed in the brain. These models are often trained without explicit linguistic supervision, but have been shown to learn some linguistic structure in the absence of such supervision (Manning et al., 2020), potentially questioning the relevance of symbolic linguistic theories in modeling such cognitive processes (Warstadt and Bowman, 2020). We evaluate across two fMRI datasets whether language models align better with brain recordings, if their attention is biased by annotations from syntactic or semantic formalisms. Using structure from dependency or minimal recursion semantic annotations, we find alignments improve significantly for one of the datasets. For another dataset, we see more mixed results. We present an extensive analysis of these results. Our proposed approach enables the evaluation of more targeted hypotheses about the composition of meaning in the brain, expanding the range of possible scientific inferences a neuroscientist could make and opens up new opportunities for cross-pollination between computational neuroscience and linguistics. -
Universal Semantic Parsing with Neural Networks.
Daniel Hershcovich.
PhD dissertation, Hebrew University of Jerusalem, 2019.A major scientific effort is dedicated to natural language understanding, which aims to be able to comprehend text, reason about it and act upon it in an intelligent way. While specific use-cases or benchmarks can be solved with relatively simple systems, which either ignore word order ("bag-of-words" models) or treat it as a simple linear structure (such as the popular sequence-to-sequence framework allowing neural networks to learn tasks in an end-to-end fashion), understanding human language in general requires a hierarchical representation of meaning. Constructing this representation from text has been the goal of an extensive line of work in semantic parsing. While many semantic representation schemes have been proposed, they share many of their basic distinctions, such as between predicates (relations, states and events) and arguments (participants). This thesis focuses on a particular semantic representation scheme called Universal Conceptual Cognitive Annotation (UCCA), whose main design principles are support for all major linguistic semantic phenomena, cross-linguistic applicability, stability across translations, ease of annotation (even by those who are not experts in linguistics) and a modular architecture supporting multiple layers of semantic annotation. A fully automatic parser is presented and evaluated on multiple languages (English, French and German). The parser, titled "TUPA" (transition-based UCCA parser), is able to learn very general graph structures: directed acyclic graphs over token sequences with non-terminal nodes for complex units, where these may cover discontinuous terminal yields. This general class of graphs covers the structures annotated in UCCA, as well as other representation schemes. TUPA is implemented as a transition-based parser, whose transition system supports these structural properties. Its transition classifier is a neural network equipped with a bidirectional long short-term memory (BiLSTM) module for calculating feature representations for the input. In an extensive comparison to conversion-based methods, as well as other classifier implementations, TUPA is shown to outperform all baselines in the task of UCCA parsing in both in-domain and out-of-domain settings in three languages. The parser is subsequently applied to two other semantic representation schemes, DM and AMR and to syntactic dependencies in the Universal Dependencies (UD) scheme. This demonstrates that the flexible parser is usable not just for UCCA parsing. Furthermore, training TUPA in a multitask setting on all of these schemes improves its UCCA parsing accuracy, by effectively learning generalizations across the different representations: a shared model is thus able to apply semantic distinctions in one task, which have been learned for another. Finally, in an empirical comparison of the content of semantic and syntactic representations, we discover several aspects of divergence, i.e., differences in the content captured by these schemes. These have profound impact on the potential contribution of syntax to semantic parsing and on the usefulness of each of the approaches for semantic tasks in natural language processing. I see semantic parsing as a means for computers to learn language. While different representations focus on different distinctions and do so with formally different structures, they share an overall goal, which is to support natural language processing applications, such as classifying text into categories, tagging it for linguistic properties, performing inference and reasoning and generating new text according to some constraints (e.g., machine translation). The combined datasets annotated in every representation are an invaluable resource, which, used effectively, can greatly boost our achievements in language understanding and processing. -
Cross-lingual Semantic Representation for NLP with UCCA.
Omri Abend, Dotan Dvir, Daniel Hershcovich, Jakob Prange and Nathan Schneider.
COLING 2020 tutorial.This is an introductory tutorial to UCCA (Universal Conceptual Cognitive Annotation), a cross-linguistically applicable framework for semantic representation, with corpora annotated in English, German and French and ongoing annotation in Russian and Hebrew. UCCA builds on extensive typological work and supports rapid annotation. The tutorial will provide a detailed introduction to the UCCA annotation guidelines, design philosophy and the available resources; and a comparison to other meaning representations. It will also survey the existing parsing work, including the findings of three recent shared tasks, in SemEval and CoNLL, that addressed UCCA parsing. Finally, the tutorial will present recent applications and extensions to the scheme, demonstrating its value for natural language processing in a range of languages and domains.
Peer-Reviewed Publications
Preprints
Dissertations
Tutorials
-
מודלים גדולים של שפה: מבנה, יכולות ואתגרים
מתמטיקה ובינה מלאכותית גנרטיבית – יצירה, שפה וחדשנות בהוראה, כנס בהמכללה האקדמית גליל מערבי (05/2025).מודלים גדולים של שפה (LLMs) חוללו מהפכה בעיבוד שפה טבעית, אך יש להם מגבלות מהותיות. הם מבוססים על השלמה סטטיסטית של טקסטים ואינם מבינים שפה כפי שבני אדם מבינים אותה. עם זאת, שיפורים מתמשכים הפכו אותם לכלים רבי-עוצמה במגוון יישומים, כולל חינוך, מדע ותכנות. בהרצאה זו נסקור את האופן שבו מודלים אלו פועלים: כיצד הם מאומנים, איך ניתן לפרש את התשובות שלהם, ומהם התחומים שבהם הם עדיין נכשלים. בנוסף, ניגע בקצרה במודלי הסקה (reasoning) ובניסיונות לשפר את יכולותיהם לחשיבה שיטתית—עד כדי כך שהם כבר מצליחים לפתור מבחני אולימפיאדה במתמטיקה ברמת תיכון. -
Towards Culturally Inclusive NLP
AI and Humanities Seminar, College of Computer Science and Technology, Wuhan University of Science and Technology (12/2024).The role of Natural Language Processing (NLP) in shaping human norms, behaviors and cultural interactions is rapidly evolving. Beyond aligning AI with human values, I propose a paradigm shift: leveraging AI systems to foster ethical and cultural alignment in human behavior. My research spans diverse applications, including cross-cultural recipe adaptation, culturally informed modeling of Scandinavian literature and tools for understanding global cultural nuances. This talk will showcase recent work, including a framework for specializing Large Language Models (LLMs) to simulate global opinion distributions and studies on assessing food-related cultural knowledge in LLMs. By addressing challenges such as cultural homogenization, biased representations and ethical complexities, this work aims to create NLP technologies that respect and reflect global diversity while advancing cross-cultural understanding and inclusivity. -
How can we use AI Language Models for Personalized Guidance and Nutrition?
Digital Tech Summit (10/2024).We investigate the use of Large Language Models (LLMs) to adapt and align generated recipes and dialogues with individual preferences, values and sustainability goals, utilizing AI to personalize dietary recommendations. By integrating user data—including health profiles, ethical values and environmental concerns—into sophisticated AI algorithms, the research aims to refine how LLMs dynamically generate and adjust content to promote adherence to plant-based diets. The study focuses on the efficacy of AI-driven recipe and argument adaptations, contributing insights into the fields of food science and consumer science on harnessing AI to facilitate sustainable eating behaviors. -
Reversing the Alignment Paradigm: LLMs Shaping Human Cultural Norms, Behaviors and Attitudes
Human-Centered Large Language Modeling Workshop (08/2024). -
Cultural Understanding and Adaptation with AI
Linguistics and English Language Seminar, University of Manchester (02/2024).In my talk, I address the critical intersection of language and culture within Natural Language Processing (NLP), proposing a novel framework aimed at expanding NLP's focus to not only encompass linguistic diversity but also the nuanced differences dictated by culture. This initiative is crucial for crafting NLP systems that truly reflect and cater to the global mosaic of users, considering how deeply cultural contexts influence language use and content preferences. Through a series of experimental studies and analyses, my research underscores the significant impact of cultural integration into language technologies, ranging from dialogue generation to content adaptation and the evaluation of biases in existing models. By exploring and applying both established and innovative methods for incorporating cultural insights, this work aspires to pioneer more inclusive, respectful and culturally aware NLP systems, highlighting the ethical and technical challenges involved in bridging the gap between language and culture to better serve the diverse global community. -
Cultural Awareness and Adaptation with AI
QualiTech Research Seminar, IBM Research Haifa (02/2024).This talk explores the intersection of AI and culture, shedding light on the imperative of integrating cultural understanding into Natural Language Processing (NLP). I will discuss empirical assessment of cultural alignment of Large Language Models (LLMs) as well as my two main research agendas: the development of culturally adaptive LLMs and the cross-cultural adaptation of linguistic content. By integrating cultural insights into LLMs, I aim to create technologies that are not only linguistically diverse but also culturally informed, allowing us to better bridge between people from different cultures. -
Assessing Knowledge of Cultural Values in ChatGPT
Kulturministeriets Tech-Kontor, Ministry of Culture, Denmark (09/2023). -
Recipe Adaptation with Language Models
Department of Food Science, University of Copenhagen (06/2023).As people explore cuisines from different cultures, it becomes important to adapt recipes to cater to different ingredient availability and preparation methods. In addition, individuals with dietary restrictions, due to health, religious or other reasons, require customized recipe adaptations. In ongoing research, my colleagues and I explore the application of large language models such as ChatGPT to address these issues, automatically adapting recipes according to various constraints. In one project, we use them to generate coherent recipes while adapting to cultural differences. In another, we focus on the challenge of adapting non-vegan recipes to vegan ones while preserving the flavor, cultural style and aesthetic of the dish. In this presentation, I will highlight the challenges this task raises, as high-quality adaptations require more than simple phrase substitutions – for example, the models are required to infer necessary modifications to preparation methods or introduce creative ingredient substitutions that are both appealing and interesting. This opens the way for a whole new field in Artificial Intelligence, bridging Natural Language Processing and Food Science. -
AI 101 – on what ChatGPT is(n’t). Large Language Models and their Potential and Limitations for Language Learning
Forum on AI in foreign language programmes, University of Copenhagen (05/2023). -
Cultural Adaptation with and of Language Models
Invited talk at the EACL 2023 workshop Cross-Cultural Considerations in NLP (05/2023).Large language models provide a unique opportunity to adapt content to the user's culture. However, this raises the risk of perpetuating cultural biases and homogenizing cultural diversity, as many of these models are currently centralized and trained on a limited set of languages and cultures. In this talk, I will emphasize the importance of adapting language models to various cultures to promote linguistic and cultural diversity. This involves not only adapting existing language models but also creating new models and resources that incorporate culturally diverse knowledge and perspectives. This approach will require collaboration and exchange of knowledge across linguistic and cultural boundaries and has the potential to support intercultural cross-fertilization in various fields, such as literature, anthropology and beyond. By adapting language models to reflect cultural diversity, we can enable more equitable access to information and foster greater intercultural understanding. -
Finding Meaning in Data across Languages
Sprogteknologisk Konference 2022, Centre for Language Technology, University of Copenhagen (11/2022)What is common to complex database queries and beautiful artistic masterpieces? Artificial intelligence can generate both from simple English prompts and large language models have been instrumental in these achievements. I will show how we can understand the patterns such models find in data using systematic meaning representations and how to create these representations so we can adapt language models across languages and cultures. -
Natural Language Processing for Sustainable Diets
Research Seminar at the Center for Applied Ecological Thinking (CApE). University of Copenhagen (11/2022) -
Generalization and Representation in Multilingual Semantic Parsing
Workshop on Ten Years of BabelNet and Multilingual Neurosymbolic Natural Language Understanding, Sapienza University of Rome (07/2022) -
Argument Mining for Green Nutrition
SODAS-Climate meeting, University of Copenhagen (05/2022) -
Challenges and Strategies in Cross-Cultural NLP
LIIR Journal Club, KU Leuven (04/2022) -
Cultural and Environmental Considerations in Natural Language Processing
Seminar at University of Haifa, Department of Information Systems (03/2022) -
Challenges and Strategies in Cross-Cultural NLP
NLP Workshop, IT University of Copenhagen (03/2022) -
Meaning Representation and Parsing in Natural Language Processing
Seminar at the Department of Food and Resource Economics, University of Copenhagen (05/2021) -
Meaning Representation Parsing
DIKU Bits, Department of Computer Science, University of Copenhagen (02/2020) -
Universal Meaning Representation Parsing
Seminar in Computational Linguistics, Department of Linguistics and Philology, Uppsala University (11/2019) -
Universal Semantic Parsing with Neural Networks
Georgetown University (06/2019) -
A Transition-Based Directed Acyclic Graph Parser for Universal Conceptual Cognitive Annotation
Tel Aviv University, NLP Seminar (01/2018) -
A Transition-Based Directed Acyclic Graph Parser for Universal Conceptual Cognitive Annotation
University of Washington, The Paul G. Allen Center for Computer Science & Engineering (07/2017) -
A Transition-Based Directed Acyclic Graph Parser for Universal Conceptual Cognitive Annotation
Technion, Computational Data Science Seminar (06/2017) -
Broad-Coverage Transition-Based UCCA Parsing
Hebrew University, Computer Science Learning Club (11/2016) -
Broad-Coverage Semantic Parsing: A Transition-Based Approach
The Israeli Seminar on Computational Linguistics, ISCOL 2016 (05/2016)
University of Copenhagen
- NAACL 2025 Senior Area Chair.
- EMNLP 2022 Workshop Co-chair.
- HPLT & NLPL Winter School 2023 Program Co-chair.
- Co-organized ISCOL 2017, the annual meeting of the Israeli seminar on computational linguistics, as well as SemEval 2019 Task 1 on Cross-lingual Semantic Parsing with UCCA, and the CoNLL 2019 and 2020 shared tasks on Cross-Framework Meaning Representation Parsing.
- Co-presented a tutorial on Cross-lingual Semantic Representation for NLP with UCCA at COLING 2020.
- Guest editor for the Künstliche Intelligenz Special Issue on NLP and Semantics.
- Area chair for NAACL-HLT 2021, ACL-IJCNLP 2021 and *SEM 2021.
- Action editor for ACL Rolling Review since October 2021.
- Reviewer for ACL (2015, 2016, 2017, 2019, 2020, 2022, 2023), EMNLP (2017, 2018: Best Reviewer Award, 2019, 2020: Outstanding Reviewer, 2023), IJCNLP 2017, COLING (2018, 2020, 2022), NAACL-HLT 2019, *SEM (2019, 2020, 2022), DMR (2019, 2020, 2023), NoDaLiDa (2019, 2021, 2023), EurNLP 2019, AAAI 2020, IJCAI 2020, IWPT (2020, 2021) AACL-IJCNLP 2020, CoNLL (2020, 2022, 2023), TLT 2020, EACL 2021 and the journals Computational Linguistics, Computer Speech & Language and Language Resources and Evaluation.
- Visions of a Connected Future, roundtable on AI, Copenhagen Institute for Futures Studies (08/01/2023)
- Sustainable Diet Arguments on Twitter:
- Kunstig intelligens 'udvinder' god kommunikation om klimavenlige madvalg af 30.000 tweets, Klimamonitor (13/01/2023)
- Forskere bruger AI til at få viden om bæredygtig kost på Twitter, CSR.dk (17/01/2023)
- Nuggets mined from thousands of tweets can persuade us to eat more climate-friendly, UCPH SCIENCE News (30/01/2023)
- Kunstig intelligens kan måle folkestemningen: Nyttigt for firmaer og politikere, men kan det misbruges?, Videnskab.dk (30/01/2023)
- University Of Copenhagen Researchers Demonstrate AI’s Role In Sustainable Food, India Education Diary (11/02/2023)
- ‘Argument mining’: Guldkorn fra tusindvis af tweets kan overbevise os om at spise mere klimavenligt, POV International (27/04/2023)
- Cultural Bias in ChatGPT:
- Samtalerobot er et redskab for amerikansk kulturimperialisme, Politiken (10/07/2023)
- ChatGPT fremmer amerikanske normer og værdier, Ekstra Bladet (10/07/2023)
- Studie: ChatGPT udbreder amerikanske normer, P1 Morgen (10/07/2023)
- ChatGPT har amerikansk bias, TV 2 Kosmopol (10/07/2023)
- Nyt studie: Populær chatbot promoverer amerikanske værdier og normer, Børsen (25/08/2023)
- Danish Researchers Look To AI For Sustainable Future, Grady Newsource, Georgia, USA (21/07/2023)
- Adjunkt i sprogmodeller: Ledere skal mindske risici for bias i "generativ AI", Tech Management, Teknologiens Mediehus (1/09/2023)
- Social influence of AI:
- Kunstig intelligens kan få konspirationsteoretikere til at ændre holdning , Videnskab.dk (27/09/2024)
- I have been living in Denmark with my wonderful family since 2019.
- I have been vegan for moral reasons since I was 15.
- I practice Kendo (Japanese fencing). I was even at the World Championships in
Japan (2015) and
Korea (2018).
Daniel Hershcovich
Department of Computer Science
University of Copenhagen
01.2.230
Vermundsgade 5
DK-2100 Copenhagen Ø, Denmark
d...
@di.ku.dk