I am a PhD student at the Department of Computer Science at Aalborg University, supervised by Johannes Bjerva (Aalborg University) and Robert Östling (Stockholm University).
I work on combining methods and findings from linguistic typology with natural language processing (NLP). In particular, I am interested in (multilingual) machine translation. These are some questions that I have been working on recently:
How does machine-generated language differ from human-written language? How can we for example influence lexical diversity in generated text?
I am passionate about diversity in computer science research, and give workshops to high school students about working on machine translation. Over time, this workshop has been attended by more than 200 students across multiple events.
TL;DR In this work, we systematically investigate NLP research that includes claims regarding 'typological diversity'. We find there are no set definitions or criteria for such claims. We introduce metrics to approximate the diversity of language selection along several axes and find that the results vary considerably across papers.
Multilingual Gradient Word-Order Typology from Universal Dependencies
TL;DR Discrete typological categorisations may differ significantly from the continuous nature of phenomena, as found in natural language corpora. In this paper, we introduce a new seed dataset made up of continuous-valued data, rather than categorical data, that may better reflect the variability of language.
A Principled Framework for Evaluating on Typologically Diverse Languages
Esther Ploeger, Wessel Poelman, Andreas Holck Høeg-Petersen, Anders Schlichtkrull, Miryam de Lhoneux, Johannes Bjerva
TL;DR We present a language sampling framework for selecting highly typologically diverse languages given a sampling frame, informed by language typology. We compare sampling methods with a range of metrics and find that our systematic methods consistently retrieve more typologically diverse language selections than previous methods in NLP.