Split readme

2025-08-22 16:25:30 +00:00 · 2020-01-10 11:09:28 -05:00
parent b27737d97c
commit e7395285f2
3 changed files with 19 additions and 15 deletions
--- a/README.md
+++ b/README.md
@ -3,20 +3,6 @@
 Provides an implementation of today's most used tokenizers, with a focus on performance and
 versatility.

-## What is a Tokenizer
-
-A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`.
-The various steps of the pipeline are:
-
-1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
-   the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
-2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
-   splitting text is simply on whitespace.
-3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
-   `BPE` or `WordPiece`.
-4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
-   that, for example, a language model would need, such as special tokens.
-
 ## Main features:

 - Train new vocabularies and tokenize, using todays most used tokenizers.
@ -32,3 +18,4 @@ The various steps of the pipeline are:

 We provide bindings to the following languages (more to come!):
  - [Python](https://github.com/huggingface/tokenizers/tree/master/bindings/python)
+  - [Node.js](https://github.com/huggingface/tokenizers/tree/master/bindings/node)
--- a/tokenizers/Cargo.toml
+++ b/tokenizers/Cargo.toml
@ -8,7 +8,7 @@ repository = "https://github.com/huggingface/tokenizers"
 documentation = "https://docs.rs/tokenizers/"
 license = "Apache-2.0"
 keywords = ["text", "tokenizer", "tokenization", "NLP", "huggingface", "BPE", "WordPiece"]
-readme = "../README.md"
+readme = "./README.md"
 description = """
 Provides an implementation of today's most used tokenizers,
 with a focus on performances and versatility.
--- a/tokenizers/README.md
+++ b/tokenizers/README.md
@ -0,0 +1,17 @@
+# Tokenizers
+
+The core of `tokenizers`, written in Rust.
+
+## What is a Tokenizer
+
+A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`.
+The various steps of the pipeline are:
+
+1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
+   the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
+2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
+   splitting text is simply on whitespace.
+3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
+   `BPE` or `WordPiece`.
+4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
+   that, for example, a language model would need, such as special tokens.