tokenizers

mii/tokenizers

Fork 0

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-09 22:28:29 +00:00

Files

History

Pierric Cistac 41fee6de3d rust: derive Copy for PaddingDirection

2020-02-05 14:44:07 -05:00

benches

refactor and rename benchmarks

2020-01-03 15:16:44 -08:00

src

rust: derive Copy for PaddingDirection

2020-02-05 14:44:07 -05:00

Cargo.toml

Ignore rust-toolchain when publishing

2020-02-05 14:12:28 -05:00

Makefile

improve Makefile

2020-01-08 10:06:42 -08:00

README.md

Fix indentation in README for consistency

2020-02-05 14:15:25 -05:00

rust-toolchain

Add rust-toolchain

2020-02-05 14:10:46 -05:00

README.md

The core of tokenizers, written in Rust. Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

What is a Tokenizer

A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding. The various steps of the pipeline are:

The Normalizer: in charge of normalizing the text. Common examples of normalization are the unicode normalization standards, such as NFD or NFKC.
The PreTokenizer: in charge of creating initial words splits in the text. The most common way of splitting text is simply on whitespace.
The Model: in charge of doing the actual tokenization. An example of a Model would be BPE or WordPiece.
The PostProcessor: in charge of post-processing the Encoding to add anything relevant that, for example, a language model would need, such as special tokens.

Quick example

use tokenizers::tokenizer::{Result, Tokenizer, EncodeInput};
use tokenizers::models::bpe::BPE;

fn main() -> Result<()> {
    let bpe_builder = BPE::from_files("./path/to/vocab.json", "./path/to/merges.txt")?;
    let bpe = bpe_builder
        .dropout(0.1)
        .unk_token("[UNK]".into())
        .build()?;

    let mut tokenizer = Tokenizer::new(Box::new(bpe));

    let encoding = tokenizer.encode(EncodeInput::Single("Hey there!".into()))?;
    println!("{:?}", encoding.get_tokens());

    Ok(())
}

Additional information

tokenizers is designed to leverage CPU parallelism when possible. The level of parallelism is determined by the total number of core/threads your CPU provides but this can be tuned by setting the RAYON_RS_NUM_CPUS environment variable. As an example setting RAYON_RS_NUM_CPUS=4 will allocate a maximum of 4 threads. Please note this behavior may evolve in the future