You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tokenization is an oft-neglected part of natural language processing. With the recent blow-up of interest in language models, it might be good to step back and really get into the guts of what tokenization is. This repo is meant to serve as a deep dive into different aspects of tokenization. It's been organized as bite-size chapters for easy navigation, with some code samples and (poorly designed) walkthrough notebooks. This is NOT meant to be a complete reference in itself, and is meant accompany other excellent resources like HuggingFace's NLP course. The following topics are covered:
Intro: A quick introduction on tokens and the different tokenization algorithms out there.
BPE: A closer look at the Byte-Pair Encoding tokenization algorithm. We'll also go over a minimal implementation for training a BPE model.
🤗 Tokenizer: The internals of HuggingFace tokenizers! We look at state (what's saved by a tokenizer), data structures (how does it store what it saves), and methods (what functionality do you get). We also implement a minimal <200 line version of the 🤗 Tokenizer in Python for GPT2.
Challenges with Tokenization: Challenges with integer tokenization, tokenization for non-English languages and going multilingual, with a focus on the recent No Language Left Behind (NLLB) effort from Meta.
Puzzles: Some simple puzzles to get you thinking about pre-tokenization, vocabulary size, etc.
PostProcessing and more: A look at special tokens and postprocessing, glitch tokens and why you might want to shrink your tokenizer.
Galactica: Thinking about tokenizer design by diving into the Galactica paper.
Chat templates: Some tokenization tips and tricks while dealing with chat-templating for chat models.
Requirements
To run the notebooks in the repo, you only need two libraries: transformers and tiktoken:
pip install transformers tiktoken
Code has been tested with transformers==4.35.0 and tiktoken==0.5.1.
Recommended Prerequisites
A basic understanding of language models and tokenization is a must: