Everything About Tokenization

Tokenization is an oft-neglected part of natural language processing. With the recent blow-up of interest in language models, it might be good to step back and really get into the guts of what tokenization is. This repo is meant to serve as a deep dive into different aspects of tokenization. It's been organized as bite-size chapters for easy navigation, with some code samples and (poorly designed) walkthrough notebooks. This is NOT meant to be a complete reference in itself, and is meant accompany other excellent resources like HuggingFace's NLP course. The following topics are covered:

Intro: A quick introduction on tokens and the different tokenization algorithms out there.
BPE: A closer look at the Byte-Pair Encoding tokenization algorithm. We'll also go over a minimal implementation for training a BPE model.
🤗 Tokenizer: The internals of HuggingFace tokenizers! We look at state (what's saved by a tokenizer), data structures (how does it store what it saves), and methods (what functionality do you get). We also implement a minimal <200 line version of the 🤗 Tokenizer in Python for GPT2.
Challenges with Tokenization: Challenges with integer tokenization, tokenization for non-English languages and going multilingual, with a focus on the recent No Language Left Behind (NLLB) effort from Meta.
Puzzles: Some simple puzzles to get you thinking about pre-tokenization, vocabulary size, etc.
PostProcessing and more: A look at special tokens and postprocessing, glitch tokens and why you might want to shrink your tokenizer.
Galactica: Thinking about tokenizer design by diving into the Galactica paper.
Chat templates: Some tokenization tips and tricks while dealing with chat-templating for chat models.

Requirements

To run the notebooks in the repo, you only need two libraries: transformers and tiktoken:

pip install transformers tiktoken

Code has been tested with transformers==4.35.0 and tiktoken==0.5.1.

Recommended Prerequisites

A basic understanding of language models and tokenization is a must:

A Hackers' Guide to Language Models by Prof. Jeremy Howard.
What makes LLM tokenizers different from each other? by Jay Alammar.
ChatGPT has Never Seen a SINGLE Word (Despite Reading Most of The Internet). Meet LLM Tokenizers. by Jay Alammar
[Optional] Chapter on tokenizers from The 🤗 NLP Course

Contributing

If you notice any mistake/bug, or feel you could make an improvement to any section of the repo, please open an issue or make a PR 🙏

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
1-intro		1-intro
2-bpe		2-bpe
3-hf-tokenizer		3-hf-tokenizer
4-tokenization-is-hard		4-tokenization-is-hard
5-puzzles		5-puzzles
6-postprocessing-and-more		6-postprocessing-and-more
7-galactica		7-galactica
8-chat-templates		8-chat-templates
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Everything About Tokenization

Requirements

Recommended Prerequisites

Contributing

About

Uh oh!

Releases

Packages

Languages

License

SumanthRH/tokenization

Folders and files

Latest commit

History

Repository files navigation

Everything About Tokenization

Requirements

Recommended Prerequisites

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages