You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tokenizer is part of an ambitious goal (together with transformer and gotch) to bring more AI/deep-learning tools to Gophers so that they can stick to the language they love and build faster software in production.
Features
tokenizer is built in modules located in sub-packages.
Normalizer
Pretokenizer
Tokenizer
Post-processing
It implements various tokenizer models:
Word level model
Wordpiece model
Byte Pair Encoding (BPE)
It can be used for both training new models from scratch or fine-tuning existing models. See examples detail.
Basic example
This tokenizer package is compatible to load pretrained models from Huggingface. Some of them can be loaded using pretrained subpackage.
import (
"fmt""log""github.com/sugarme/tokenizer/pretrained"
)
funcmain() {
// Download and cache pretrained tokenizer. In this case `bert-base-uncased` from Huggingface// can be any model with `tokenizer.json` available. E.g. `tiiuae/falcon-7b`configFile, err:=tokenizer.CachedPath("bert-base-uncased", "tokenizer.json")
iferr!=nil {
panic(err)
}
tk, err:=pretrained.FromFile(configFile)
iferr!=nil {
panic(err)
}
sentence:=`The Gophers craft code using [MASK] language.`en, err:=tk.EncodeSingle(sentence)
iferr!=nil {
log.Fatal(err)
}
fmt.Printf("tokens: %q\n", en.Tokens)
fmt.Printf("offsets: %v\n", en.Offsets)
// Output// tokens: ["the" "go" "##pher" "##s" "craft" "code" "using" "[MASK]" "language" "."]// offsets: [[0 3] [4 6] [6 10] [10 11] [12 17] [18 22] [23 28] [29 35] [36 44] [44 45]]
}
All models can be loaded from files manually. pkg.go.dev for detail APIs.