Provides an implementation of today's most used tokenizers with a focus on performances and versatility. The goal is to make it as easy as possible to construct a Tokenizer, learn a vocabulary, and then process some text either in real time or in advance.

What is a Tokenizer

A Tokenizer works as a pipeline taking some raw text as input, going through multiple steps to finally output a list of Tokens. The various steps of the pipeline are:

Some optional Normalizers. An example would be a Unicode normalization step. They take some raw text as input, and also output raw text String.
An optional PreTokenizer which should take some raw text and take care of spliting as relevant, and pre-processing tokens if needed. Takes a raw text String as input, and outputs a Vec<String>.
A Model to do the actual tokenization. An example of Model would be BPE. Takes a Vec<String> as input, and gives a Vec<Token>.
Some optional PostProcessors. These are in charge of post processing the list of Tokens in any relevant way. This includes truncating, adding some padding, ...

Try the shell

You can try a simple ByteLevel BPE Tokenizer by using the following command. This expects vocab.json and merges.txt files, trained with ByteLevel BPE.

cd tokenizers
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
cargo run --release shell --vocab gpt2-vocab.json --merges gpt2-merges.txt

Languages

Rust 72.3%

Python 20%

Jupyter Notebook 4.5%

TypeScript 2.3%

JavaScript 0.4%

Other 0.5%