Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language
Overview

Huge effort has been put into optimizing LM pretraining at massive scales in the last several years. While growing parameter counts often get the most attention, datasets have also grown by orders of magnitude. For example, Chinchilla sees 1.4 trillion words during training---well over 10000 words for every one word a 13 year old child has heard in their entire life.
The goal of this workshop is to incentivize researchers with an interest in pretraining or cognitive modeling to focus their efforts on optimizing pretraining given data limitations inspired by human development. Additionally, we hope to democratize research on pretraining—which is typically thought to be practical only for large industry groups—by drawing attention to open problems that can be addressed on a university budget.
Why <100 Million Words?
Focusing on scaled-down pretraining has several potential benefits:
First, small-scale pretraining can be a sandbox for developing novel techniques for improving data efficiency. These techniques have the potential to then scale up to larger scales commonly seen in applied NLP or used to enhance current approaches to modeling low-resource languages. Second, improving our ability to train LMs on the same kinds and quantities of data that humans learn from hopefully will give us greater access to plausible cognitive models of humans and help us understand what allows humans to acquire language so efficiently.
Organization Team
• Lucas Charpentier (LTG, University of Oslo)
• Leshem Choshen (IBM Research, MIT)
• Ryan Cotterell (ETH Zurich)
• Mustafa Omer Gul (Cornell University)
• Michael Hu (NYU)
• Jing Liu (ENS-PSL)
• Jaap Jumelet (University of Groningen)
• Tal Linzen (NYU)
• Aaron Mueller (Northeastern)
• Candace Ross (Meta AI)
• Raj Sanjay Shah (Georgia Institute of Technology)
• Alex Warstadt (UCSD)
• Ethan Wilcox (Georgetown)
• Adina Williams (Meta AI)

The BabyLM Challenge was held in 2023 and 2024 as a shared task. At the following link, you can find the last year's call for papers .