mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Split readme
This commit is contained in:
15
README.md
15
README.md
@ -3,20 +3,6 @@
|
||||
Provides an implementation of today's most used tokenizers, with a focus on performance and
|
||||
versatility.
|
||||
|
||||
## What is a Tokenizer
|
||||
|
||||
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`.
|
||||
The various steps of the pipeline are:
|
||||
|
||||
1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
|
||||
the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
|
||||
2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
|
||||
splitting text is simply on whitespace.
|
||||
3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
|
||||
`BPE` or `WordPiece`.
|
||||
4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
|
||||
that, for example, a language model would need, such as special tokens.
|
||||
|
||||
## Main features:
|
||||
|
||||
- Train new vocabularies and tokenize, using todays most used tokenizers.
|
||||
@ -32,3 +18,4 @@ The various steps of the pipeline are:
|
||||
|
||||
We provide bindings to the following languages (more to come!):
|
||||
- [Python](https://github.com/huggingface/tokenizers/tree/master/bindings/python)
|
||||
- [Node.js](https://github.com/huggingface/tokenizers/tree/master/bindings/node)
|
||||
|
@ -8,7 +8,7 @@ repository = "https://github.com/huggingface/tokenizers"
|
||||
documentation = "https://docs.rs/tokenizers/"
|
||||
license = "Apache-2.0"
|
||||
keywords = ["text", "tokenizer", "tokenization", "NLP", "huggingface", "BPE", "WordPiece"]
|
||||
readme = "../README.md"
|
||||
readme = "./README.md"
|
||||
description = """
|
||||
Provides an implementation of today's most used tokenizers,
|
||||
with a focus on performances and versatility.
|
||||
|
17
tokenizers/README.md
Normal file
17
tokenizers/README.md
Normal file
@ -0,0 +1,17 @@
|
||||
# Tokenizers
|
||||
|
||||
The core of `tokenizers`, written in Rust.
|
||||
|
||||
## What is a Tokenizer
|
||||
|
||||
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`.
|
||||
The various steps of the pipeline are:
|
||||
|
||||
1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
|
||||
the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
|
||||
2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
|
||||
splitting text is simply on whitespace.
|
||||
3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
|
||||
`BPE` or `WordPiece`.
|
||||
4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
|
||||
that, for example, a language model would need, such as special tokens.
|
Reference in New Issue
Block a user