2020-01-07 18:54:21 -05:00
2020-01-07 18:54:21 -05:00
2020-01-07 18:40:18 -05:00
2020-01-07 18:54:21 -05:00
2020-01-04 23:31:02 -05:00
2020-01-07 05:33:59 -08:00

Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

What is a Tokenizer

A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding. The various steps of the pipeline are:

  1. The Normalizer: in charge of normalizing the text. Common examples of normalization are the unicode normalization standards, such as NFD or NFKC.
  2. The PreTokenizer: in charge of creating initial words splits in the text. The most common way of splitting text is simply on whitespace.
  3. The Model: in charge of doing the actual tokenization. An example of a Model would be BPE or WordPiece.
  4. The PostProcessor: in charge of post-processing the Encoding to add anything relevant that, for example, a language model would need, such as special tokens.

Bindings

We provide bindings to the following languages (more to come!):

Description
No description provided
Readme Apache-2.0 7.4 MiB
Languages
Rust 72.3%
Python 20%
Jupyter Notebook 4.5%
TypeScript 2.3%
JavaScript 0.4%
Other 0.5%