diff --git a/README.md b/README.md index 754fba7a..f549f479 100644 --- a/README.md +++ b/README.md @@ -1,22 +1,22 @@ # Tokenizers -Provides an implementation of today's most used tokenizers, with a focus on performances and +Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. ## What is a Tokenizer -A Tokenizer works as a pipeline, processing some raw text as input, to finally output an +A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`. The various steps of the pipeline are: -1. The `Normalizer` is in charge of normalizing the text. Common examples of Normalization are - the unicode normalization standards, such as `NFD` or `NFKC`. -2. The `PreTokenizer` is in charge of splitting the text as relevant. The most common way of - splitting text is simply on whitespaces, to manipulate words. -3. The `Model` is in charge of doing the actual tokenization. An example of `Model` would be +1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are + the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`. +2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of + splitting text is simply on whitespace. +3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be `BPE` or `WordPiece`. -4. The `PostProcessor` is in charge of post processing the `Encoding`, to add anything relevant - that a language model would need, like special tokens. +4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant + that, for example, a language model would need, such as special tokens. ## Bindings diff --git a/tokenizers/src/lib.rs b/tokenizers/src/lib.rs index 5cd5d7d9..59e38f83 100644 --- a/tokenizers/src/lib.rs +++ b/tokenizers/src/lib.rs @@ -2,27 +2,25 @@ #![doc(html_favicon_url = "https://huggingface.co/favicon.ico")] #![doc(html_logo_url = "https://huggingface.co/landing/assets/huggingface_logo.svg")] -//! //! # Tokenizers //! -//! Provides an implementation of today's most used tokenizers, with a focus on performances and +//! Provides an implementation of today's most used tokenizers, with a focus on performance and //! versatility. //! //! ## What is a Tokenizer //! -//! A Tokenizer works as a pipeline, processing some raw text as input, to finally output an +//! A Tokenizer works as a pipeline, it processes some raw text as input and outputs an //! `Encoding`. //! The various steps of the pipeline are: //! -//! 1. The `Normalizer` is in charge of normalizing the text. Common examples of Normalization are -//! the unicode normalization standards, such as `NFD` or `NFKC`. -//! 2. The `PreTokenizer` is in charge of splitting the text as relevant. The most common way of -//! splitting text is simply on whitespaces, to manipulate words. -//! 3. The `Model` is in charge of doing the actual tokenization. An example of `Model` would be +//! 1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are +//! the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`. +//! 2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of +//! splitting text is simply on whitespace. +//! 3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be //! `BPE` or `WordPiece`. -//! 4. The `PostProcessor` is in charge of post processing the `Encoding`, to add anything relevant -//! that a language model would need, like special tokens. -//! +//! 4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant +//! that, for example, a language model would need, such as special tokens. #[macro_use] extern crate lazy_static;