Split readme

This commit is contained in:
Anthony MOI
2020-01-10 11:09:28 -05:00
parent b27737d97c
commit e7395285f2
3 changed files with 19 additions and 15 deletions

View File

@ -3,20 +3,6 @@
Provides an implementation of today's most used tokenizers, with a focus on performance and
versatility.
## What is a Tokenizer
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`.
The various steps of the pipeline are:
1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
splitting text is simply on whitespace.
3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
`BPE` or `WordPiece`.
4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
that, for example, a language model would need, such as special tokens.
## Main features:
- Train new vocabularies and tokenize, using todays most used tokenizers.
@ -32,3 +18,4 @@ The various steps of the pipeline are:
We provide bindings to the following languages (more to come!):
- [Python](https://github.com/huggingface/tokenizers/tree/master/bindings/python)
- [Node.js](https://github.com/huggingface/tokenizers/tree/master/bindings/node)