diff --git a/README.md b/README.md index 08720655..94476acb 100644 --- a/README.md +++ b/README.md @@ -3,20 +3,6 @@ Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. -## What is a Tokenizer - -A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`. -The various steps of the pipeline are: - -1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are - the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`. -2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of - splitting text is simply on whitespace. -3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be - `BPE` or `WordPiece`. -4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant - that, for example, a language model would need, such as special tokens. - ## Main features: - Train new vocabularies and tokenize, using todays most used tokenizers. @@ -32,3 +18,4 @@ The various steps of the pipeline are: We provide bindings to the following languages (more to come!): - [Python](https://github.com/huggingface/tokenizers/tree/master/bindings/python) + - [Node.js](https://github.com/huggingface/tokenizers/tree/master/bindings/node) diff --git a/tokenizers/Cargo.toml b/tokenizers/Cargo.toml index c25f2074..c9cf3f25 100644 --- a/tokenizers/Cargo.toml +++ b/tokenizers/Cargo.toml @@ -8,7 +8,7 @@ repository = "https://github.com/huggingface/tokenizers" documentation = "https://docs.rs/tokenizers/" license = "Apache-2.0" keywords = ["text", "tokenizer", "tokenization", "NLP", "huggingface", "BPE", "WordPiece"] -readme = "../README.md" +readme = "./README.md" description = """ Provides an implementation of today's most used tokenizers, with a focus on performances and versatility. diff --git a/tokenizers/README.md b/tokenizers/README.md new file mode 100644 index 00000000..6f313ede --- /dev/null +++ b/tokenizers/README.md @@ -0,0 +1,17 @@ +# Tokenizers + +The core of `tokenizers`, written in Rust. + +## What is a Tokenizer + +A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`. +The various steps of the pipeline are: + +1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are + the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`. +2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of + splitting text is simply on whitespace. +3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be + `BPE` or `WordPiece`. +4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant + that, for example, a language model would need, such as special tokens.