Quick README update

2025-08-22 16:25:30 +00:00 · 2020-01-08 14:07:48 -05:00
parent 988159a998
commit bbe31f9237
2 changed files with 85 additions and 16 deletions
--- a/README.md
+++ b/README.md
@ -5,8 +5,7 @@ versatility.

 ## What is a Tokenizer

-A Tokenizer works as a pipeline, it processes some raw text as input and outputs an
-`Encoding`.
+A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`.
 The various steps of the pipeline are:

 1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
@ -18,6 +17,17 @@ The various steps of the pipeline are:
 4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
   that, for example, a language model would need, such as special tokens.

+## Main features:
+
+ - Train new vocabularies and tokenize, using todays most used tokenizers.
+ - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
+   less than 20 seconds to tokenize a GB of text on a server's CPU.
+ - Easy to use, but also extremely versatile.
+ - Designed for research and production.
+ - Normalization comes with alignments tracking. It's always possible to get the part of the
+   original sentence that corresponds to a given token.
+ - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
+
 ## Bindings

 We provide bindings to the following languages (more to come!):