The Normalizer is in charge of normalizing the text. Common examples of Normalization are the unicode normalization standards, such as NFD or NFKC.
The PreTokenizer is in charge of splitting the text as relevant. The most common way of splitting text is simply on whitespaces, to manipulate words.
The Model is in charge of doing the actual tokenization. An example of Model would be BPE or WordPiece.
The PostProcessor is in charge of post processing the Encoding, to add anything relevant that a language model would need, like special tokens.

Bindings

We provide bindings to the following languages (more to come!):

Languages

Rust 72.3%

Python 20%

Jupyter Notebook 4.5%

TypeScript 2.3%

JavaScript 0.4%

Other 0.5%