2020-01-10 11:29:11 -05:00
2020-01-10 10:06:24 -05:00
2020-01-10 11:09:28 -05:00
2020-01-07 18:54:21 -05:00
2020-01-04 23:31:02 -05:00
2020-01-10 11:29:11 -05:00



Build GitHub

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

  • Train new vocabularies and tokenize, using todays most used tokenizers.
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.



Bindings

We provide bindings to the following languages (more to come!):

Description
No description provided
Readme Apache-2.0 7.4 MiB
Languages
Rust 72.3%
Python 20%
Jupyter Notebook 4.5%
TypeScript 2.3%
JavaScript 0.4%
Other 0.5%