| CARVIEW |
tiktoken: Haskell implementation of tiktoken
This packages only implements tokenization. In other words,
given an existing encoding (cl100k_base) you can tokenize
an input.
[Skip to Readme]
Downloads
- tiktoken-1.0.3.tar.gz [browse] (Cabal source package)
- Package description (revised from the package)
Note: This package has metadata revisions in the cabal description newer than included in the tarball. To unpack the package including the revisions, use 'cabal get'.
Maintainer's Corner
For package maintainers and hackage trustees
Candidates
| Versions [RSS] | 1.0.0, 1.0.1, 1.0.2, 1.0.3 |
|---|---|
| Change log | CHANGELOG.md |
| Dependencies | base (>=4.15.0.0 && <5), base64 (>=1.0 && <1.1), bytestring (>=0.11.3.0), containers (>=0.5.0.0), deepseq (>=1.4.0.0), filepath, megaparsec (<9.8), pcre-light (>=0.2), raw-strings-qq, text, unordered-containers [details] |
| License | BSD-3-Clause |
| Author | Gabriella Gonzalez |
| Maintainer | GenuineGabriella@gmail.com |
| Uploaded | by GabrielGonzalez at 2024-09-02T21:19:08Z |
| Revised | Revision 1 made by GabrielGonzalez at 2025-06-26T02:43:26Z |
| Distributions | NixOS:1.0.3 |
| Downloads | 244 total (24 in the last 30 days) |
| Rating | (no votes yet) [estimated by Bayesian average] |
| Your Rating |
|
| Status | Docs available [build log] Last success reported on 2024-09-02 [all 1 reports] |
Readme for tiktoken-1.0.3
[back to package description]tiktoken
This is a Haskell implementation of
tiktoken, but just the tokenization
logic. In other words, given an existing encoding (like cl100k_base) you
can tokenize a string (into smaller strings or token ranks).
This means that you can't (yet) use this package to create your own new encodings, but you can use it to consume encodings. In particular, this comes in handy for prompt engineering where you want to use as much of the available prompt tokens as possible (which requires accurately counting tokens).
Encoding speed is ≈2.6-3.1 MB/s on an M1 MacBook Pro (using only one core since this package does not yet support parallel tokenization):
All
Encode 10 MB of Wikipedia
r50k_base: OK (23.88s)
3.356 s ± 151 ms
p50k_base: OK (10.39s)
3.445 s ± 31 ms
p50k_edit: OK (11.13s)
3.693 s ± 240 ms
cl100k_base: OK (11.16s)
3.685 s ± 143 ms
o200k_base: OK (11.01s)
3.648 s ± 134 ms