The Normalizer: in charge of normalizing the text. Common examples of normalization are the unicode normalization standards, such as NFD or NFKC.
The PreTokenizer: in charge of creating initial words splits in the text. The most common way of splitting text is simply on whitespace.
The Model: in charge of doing the actual tokenization. An example of a Model would be BPE or WordPiece.
The PostProcessor: in charge of post-processing the Encoding to add anything relevant that, for example, a language model would need, such as special tokens.

Bindings

We provide bindings to the following languages (more to come!):

Languages

Rust 72.3%

Python 20%

Jupyter Notebook 4.5%

TypeScript 2.3%

JavaScript 0.4%

Other 0.5%