Started an internship at Ai2 in Seattle over the Summer to work with Luca Soldaini and Valentin Hofmann!
Nov 29, 2024
Talk about «The Past, Present and Future of Tokenization» at the NLIP Seminar in Cambridge. This talk was based on an Invited Lecture at the University of Göttingen in early November. Slides.
Nov 27, 2024
Attended the ELLIS NLP Workshop at Dagstuhl. Some nice photos.
Sep 25, 2024
Zero-Shot Tokenizer Transfer is accepted at NeurIPS 2024. See you in Vancouver!
Jul 24, 2024
I presented Zero-Shot Tokenizer Transfer at Google DeepMind and Mozilla. Slides.
Selected Publications
Cross-Tokenizer Distillation via Approximate Likelihood Matching
Benjamin Minixhofer, Ivan Vulić, and Edoardo Maria Ponti
In The Thirty-ninth Annual Conference on Neural Information Processing Systems, Dec 2025
While many languages possess processes of joining two or more words to create compound words, previous studies have been typically limited only to languages with excessively productive compound formation (e.g., German, Dutch) and there is no public dataset containing compound and non-compound words across a large number of languages. In this work, we systematically study decompounding, the task of splitting compound words into their constituents, at a wide scale. We first address the data gap by introducing a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We then use this dataset to evaluate an array of Large Language Models (LLMs) on the decompounding task. We find that LLMs perform poorly, especially on words which are tokenized unfavorably by subword tokenization. We thus introduce a novel methodology to train dedicated models for decompounding. The proposed two-stage procedure relies on a fully self-supervised objective in the first stage, while the second, supervised learning stage optionally fine-tunes the model on the annotated Wiktionary data. Our self-supervised models outperform the prior best unsupervised decompounding models by 13.9% accuracy on average. Our fine-tuned models outperform all prior (language-specific) decompounding tools. Furthermore, we use our models to leverage decompounding during the creation of a subword tokenizer, which we refer to as CompoundPiece. CompoundPiece tokenizes compound words more favorably on average, leading to improved performance on decompounding over an otherwise equivalent model using SentencePiece tokenization.
Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation
Benjamin Minixhofer, Jonas Pfeiffer, and Ivan Vulić
In Proceedings of the 2023 Conference of the Association for Computational Linguistics: Human Language Technologies, Jul 2023
WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models
Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz
In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jul 2022