mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-23 00:35:35 +00:00
Split readme
This commit is contained in:
15
README.md
15
README.md
@ -3,20 +3,6 @@
|
|||||||
Provides an implementation of today's most used tokenizers, with a focus on performance and
|
Provides an implementation of today's most used tokenizers, with a focus on performance and
|
||||||
versatility.
|
versatility.
|
||||||
|
|
||||||
## What is a Tokenizer
|
|
||||||
|
|
||||||
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`.
|
|
||||||
The various steps of the pipeline are:
|
|
||||||
|
|
||||||
1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
|
|
||||||
the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
|
|
||||||
2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
|
|
||||||
splitting text is simply on whitespace.
|
|
||||||
3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
|
|
||||||
`BPE` or `WordPiece`.
|
|
||||||
4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
|
|
||||||
that, for example, a language model would need, such as special tokens.
|
|
||||||
|
|
||||||
## Main features:
|
## Main features:
|
||||||
|
|
||||||
- Train new vocabularies and tokenize, using todays most used tokenizers.
|
- Train new vocabularies and tokenize, using todays most used tokenizers.
|
||||||
@ -32,3 +18,4 @@ The various steps of the pipeline are:
|
|||||||
|
|
||||||
We provide bindings to the following languages (more to come!):
|
We provide bindings to the following languages (more to come!):
|
||||||
- [Python](https://github.com/huggingface/tokenizers/tree/master/bindings/python)
|
- [Python](https://github.com/huggingface/tokenizers/tree/master/bindings/python)
|
||||||
|
- [Node.js](https://github.com/huggingface/tokenizers/tree/master/bindings/node)
|
||||||
|
@ -8,7 +8,7 @@ repository = "https://github.com/huggingface/tokenizers"
|
|||||||
documentation = "https://docs.rs/tokenizers/"
|
documentation = "https://docs.rs/tokenizers/"
|
||||||
license = "Apache-2.0"
|
license = "Apache-2.0"
|
||||||
keywords = ["text", "tokenizer", "tokenization", "NLP", "huggingface", "BPE", "WordPiece"]
|
keywords = ["text", "tokenizer", "tokenization", "NLP", "huggingface", "BPE", "WordPiece"]
|
||||||
readme = "../README.md"
|
readme = "./README.md"
|
||||||
description = """
|
description = """
|
||||||
Provides an implementation of today's most used tokenizers,
|
Provides an implementation of today's most used tokenizers,
|
||||||
with a focus on performances and versatility.
|
with a focus on performances and versatility.
|
||||||
|
17
tokenizers/README.md
Normal file
17
tokenizers/README.md
Normal file
@ -0,0 +1,17 @@
|
|||||||
|
# Tokenizers
|
||||||
|
|
||||||
|
The core of `tokenizers`, written in Rust.
|
||||||
|
|
||||||
|
## What is a Tokenizer
|
||||||
|
|
||||||
|
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`.
|
||||||
|
The various steps of the pipeline are:
|
||||||
|
|
||||||
|
1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
|
||||||
|
the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
|
||||||
|
2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
|
||||||
|
splitting text is simply on whitespace.
|
||||||
|
3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
|
||||||
|
`BPE` or `WordPiece`.
|
||||||
|
4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
|
||||||
|
that, for example, a language model would need, such as special tokens.
|
Reference in New Issue
Block a user