Toke(n)icer

A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.

News

02/21/2025 0.0.4: ⚡ Now tokenicer instance dynamically inherits the native tokenizer.__class__ of tokenizer passed in or loaded via our tokenicer.load() api. CI now tests tokenizer compat from 64 different models.
02/10/2025 0.0.2: 🤗 Initial release!

Features:

Compatible with all HF Transformers recognized tokenizers
Auto-fix models not setting padding_token
Auto-Fix models released with wrong padding_token: many models incorrectly use eos_token as pad_token which leads to subtle and hidden errors in post-training and inference when batching is used which is almost always.
Zero external dependency outside of Transformers

Upcoming Features:

Add automatic tokenizer validation to model training and subsequent inference so that not only tokenizer config but actual decode/encode are 100% re-validated on model load. Often the case, inference and training engines modifies the traditional tokenizers causing subtle and inaccurate output when inference performed on a platform that is disjointed from the trainer.

Install

PIP/UV

pip install -v tokenicer
uv pip install -v tokenicer

Install from source

# clone repo
git clone https://github.com/ModelCloud/Tokencier.git && cd Tokenicer
# compile
pip install -v .

Usage

Replace all calls to AutoTokenizer.from_pretrained() with Tokenizer.load(): args are 100% compatible with AutoTokenizer

# Replace `AutoTokenizer.from_pretrained()`
# from tokenizer import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')
# With `Tokenicer.load()`
from tokenicer import Tokenicer
# Returns `Tokenicer` instance that inherits original `Qwen2TokenizerFast` type.
tokenizer = Tokenicer.load('Qwen/Qwen2.5-0.5B-Instruct')
# That's it! Toke(n)icer has auto-fixed Qwen2.5-0.5B-Instruct's incorrect `pad_token`.
# Now this this model can be `trained` and `inferenced` correctly with `batch` and `masks`.
# Now use the new tokenizer like any normal HF PretrainedTokenizer(Fast)
print(f"pad_token: `{tokenizer.pad_token}`")

Citation

@misc{gptqmodel,
    author = {ModelCloud.ai and qubitium@modelcloud.ai},
    title = {Toke(n)icer},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/modelcloud/tokenicer}},
    note = {Contact: qubitium@modelcloud.ai}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
format		format
tests		tests
tokenicer		tokenicer
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Toke(n)icer

News

Features:

Upcoming Features:

Install

PIP/UV

Install from source

Usage

Citation

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

ModelCloud/Tokenicer

Folders and files

Latest commit

History

Repository files navigation

Toke(n)icer

News

Features:

Upcoming Features:

Install

PIP/UV

Install from source

Usage

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages