mirror of
https://github.com/mii443/tokenizers.git
synced 2025-12-03 03:08:21 +00:00
989e9b03ca39245905350093d08f96fa475ab959
Tokenizers
Provides an implementation of today's most used tokenizers with a focus on performances and versatility. The goal is to make it as easy as possible to construct a Tokenizer, learn a vocabulary, and then process some text either in real time or in advance.
What is a Tokenizer
A Tokenizer works as a pipeline taking some raw text as input, going through multiple steps to
finally output a list of Tokens. The various steps of the pipeline are:
- Some optional
Normalizers. An example would be a Unicode normalization step. They take some raw text as input, and also output raw textString. - An optional
PreTokenizerwhich should take some raw text and take care of spliting as relevant, and pre-processing tokens if needed. Takes a raw textStringas input, and outputs aVec<String>. - A
Modelto do the actual tokenization. An example ofModelwould beBPE. Takes aVec<String>as input, and gives aVec<Token>. - Some optional
PostProcessors. These are in charge of post processing the list ofTokens in any relevant way. This includes truncating, adding some padding, ...
Try the shell
You can try a simple ByteLevel BPE Tokenizer by using the following command. This expects
vocab.json and merges.txt files, trained with ByteLevel BPE.
cd tokenizers
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
cargo run --release shell --vocab gpt2-vocab.json --merges gpt2-merges.txt
Description
Languages
Rust
72.3%
Python
20%
Jupyter Notebook
4.5%
TypeScript
2.3%
JavaScript
0.4%
Other
0.5%