mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
5b8cd00d21b0a16f15606efdb366a40de9d3294d
Tokenizers
Provides an implementation of today's most used tokenizers, with a focus on performances and versatility.
What is a Tokenizer
A Tokenizer works as a pipeline, processing some raw text as input, to finally output an
Encoding
.
The various steps of the pipeline are:
- The
Normalizer
is in charge of normalizing the text. Common examples of Normalization are the unicode normalization standards, such asNFD
orNFKC
. - The
PreTokenizer
is in charge of splitting the text as relevant. The most common way of splitting text is simply on whitespaces, to manipulate words. - The
Model
is in charge of doing the actual tokenization. An example ofModel
would beBPE
orWordPiece
. - The
PostProcessor
is in charge of post processing theEncoding
, to add anything relevant that a language model would need, like special tokens.
Bindings
We provide bindings to the following languages (more to come!):
Description
Languages
Rust
72.3%
Python
20%
Jupyter Notebook
4.5%
TypeScript
2.3%
JavaScript
0.4%
Other
0.5%