You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.
News
02/21/2025 0.0.4: ⚡ Now tokenicer instance dynamically inherits the nativetokenizer.__class__ of tokenizer passed in or loaded via our tokenicer.load() api. CI now tests tokenizer compat from 64 different models.
Compatible with all HF Transformers recognized tokenizers
Auto-fix models not setting padding_token
Auto-Fix models released with wrong padding_token: many models incorrectly use eos_token as pad_token which leads to subtle and hidden errors in post-training and inference when batching is used which is almost always.
Zero external dependency outside of Transformers
Upcoming Features:
Add automatic tokenizer validation to modeltraining and subsequent inference so that not only tokenizer config but actual decode/encode are 100% re-validated on model load. Often the case, inference and training engines modifies the traditional tokenizers causing subtle and inaccurate output when inference performed on a platform that is disjointed from the trainer.
Replace all calls to AutoTokenizer.from_pretrained() with Tokenizer.load(): args are 100% compatible with AutoTokenizer
# Replace `AutoTokenizer.from_pretrained()`# from tokenizer import AutoTokenizer# tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')# With `Tokenicer.load()`fromtokenicerimportTokenicer# Returns `Tokenicer` instance that inherits original `Qwen2TokenizerFast` type.tokenizer=Tokenicer.load('Qwen/Qwen2.5-0.5B-Instruct')
# That's it! Toke(n)icer has auto-fixed Qwen2.5-0.5B-Instruct's incorrect `pad_token`.# Now this this model can be `trained` and `inferenced` correctly with `batch` and `masks`.# Now use the new tokenizer like any normal HF PretrainedTokenizer(Fast)print(f"pad_token: `{tokenizer.pad_token}`")
Citation
@misc{gptqmodel,
author = {ModelCloud.ai and qubitium@modelcloud.ai},
title = {Toke(n)icer},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/modelcloud/tokenicer}},
note = {Contact: qubitium@modelcloud.ai}
}
About
A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.