mirror of
https://github.com/mii443/tokenizers.git
synced 2025-12-09 22:28:29 +00:00
The core of tokenizers, written in Rust.
Provides an implementation of today's most used tokenizers, with a focus on performance and
versatility.
What is a Tokenizer
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding.
The various steps of the pipeline are:
- The
Normalizer: in charge of normalizing the text. Common examples of normalization are the unicode normalization standards, such asNFDorNFKC. - The
PreTokenizer: in charge of creating initial words splits in the text. The most common way of splitting text is simply on whitespace. - The
Model: in charge of doing the actual tokenization. An example of aModelwould beBPEorWordPiece. - The
PostProcessor: in charge of post-processing theEncodingto add anything relevant that, for example, a language model would need, such as special tokens.
Quick example
use tokenizers::tokenizer::{Result, Tokenizer, EncodeInput};
use tokenizers::models::bpe::BPE;
fn main() -> Result<()> {
let bpe_builder = BPE::from_files("./path/to/vocab.json", "./path/to/merges.txt")?;
let bpe = bpe_builder
.dropout(0.1)
.unk_token("[UNK]".into())
.build()?;
let mut tokenizer = Tokenizer::new(Box::new(bpe));
let encoding = tokenizer.encode(EncodeInput::Single("Hey there!".into()))?;
println!("{:?}", encoding.get_tokens());
Ok(())
}
Additional information
- tokenizers is designed to leverage CPU parallelism when possible. The level of parallelism is determined
by the total number of core/threads your CPU provides but this can be tuned by setting the
RAYON_RS_NUM_CPUSenvironment variable. As an example settingRAYON_RS_NUM_CPUS=4will allocate a maximum of 4 threads. Please note this behavior may evolve in the future
