mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
79b96dccd0dd6945ab4e505eac5bb2ad9f35e13f
Since each character being lowercased or uppercased can actually generate one or more characters, we need to keep track of the offsets being updated in the process.
Tokenizers
Provides an implementation of today's most used tokenizers with a focus on performances and versatility. The goal is to make it as easy as possible to construct a Tokenizer, learn a vocabulary, and then process some text either in real time or in advance.
What is a Tokenizer
A Tokenizer works as a pipeline taking some raw text as input, going through multiple steps to
finally output a list of Token
s. The various steps of the pipeline are:
- Some optional
Normalizer
s. An example would be a Unicode normalization step. They take some raw text as input, and also output raw textString
. - An optional
PreTokenizer
which should take some raw text and take care of spliting as relevant, and pre-processing tokens if needed. Takes a raw textString
as input, and outputs aVec<String>
. - A
Model
to do the actual tokenization. An example ofModel
would beBPE
. Takes aVec<String>
as input, and gives aVec<Token>
. - Some optional
PostProcessor
s. These are in charge of post processing the list ofToken
s in any relevant way. This includes truncating, adding some padding, ...
Try the shell
You can try a simple ByteLevel BPE Tokenizer by using the following command. This expects
vocab.json
and merges.txt
files, trained with ByteLevel BPE.
cd tokenizers
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
cargo run --release shell --vocab gpt2-vocab.json --merges gpt2-merges.txt
Description
Languages
Rust
72.3%
Python
20%
Jupyter Notebook
4.5%
TypeScript
2.3%
JavaScript
0.4%
Other
0.5%