small wording changes (#39)

* small wording changes * fix formatting
2025-08-22 16:25:30 +00:00 · 2020-01-07 05:33:59 -08:00
parent b06681cb1e
commit 49a67824ce
2 changed files with 18 additions and 20 deletions
--- a/README.md
+++ b/README.md
@ -1,22 +1,22 @@
 # Tokenizers

-Provides an implementation of today's most used tokenizers, with a focus on performances and
+Provides an implementation of today's most used tokenizers, with a focus on performance and
 versatility.

 ## What is a Tokenizer

-A Tokenizer works as a pipeline, processing some raw text as input, to finally output an
+A Tokenizer works as a pipeline, it processes some raw text as input and outputs an
 `Encoding`.
 The various steps of the pipeline are:

-1. The `Normalizer` is in charge of normalizing the text. Common examples of Normalization are
-   the unicode normalization standards, such as `NFD` or `NFKC`.
-2. The `PreTokenizer` is in charge of splitting the text as relevant. The most common way of
-   splitting text is simply on whitespaces, to manipulate words.
-3. The `Model` is in charge of doing the actual tokenization. An example of `Model` would be
+1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
+   the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
+2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
+   splitting text is simply on whitespace.
+3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
   `BPE` or `WordPiece`.
-4. The `PostProcessor` is in charge of post processing the `Encoding`, to add anything relevant
-   that a language model would need, like special tokens.
+4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
+   that, for example, a language model would need, such as special tokens.

 ## Bindings

--- a/tokenizers/src/lib.rs
+++ b/tokenizers/src/lib.rs
@ -2,27 +2,25 @@
 #![doc(html_favicon_url = "https://huggingface.co/favicon.ico")]
 #![doc(html_logo_url = "https://huggingface.co/landing/assets/huggingface_logo.svg")]

-//!
 //! # Tokenizers
 //!
-//! Provides an implementation of today's most used tokenizers, with a focus on performances and
+//! Provides an implementation of today's most used tokenizers, with a focus on performance and
 //! versatility.
 //!
 //! ## What is a Tokenizer
 //!
-//! A Tokenizer works as a pipeline, processing some raw text as input, to finally output an
+//! A Tokenizer works as a pipeline, it processes some raw text as input and outputs an
 //! `Encoding`.
 //! The various steps of the pipeline are:
 //!
-//! 1. The `Normalizer` is in charge of normalizing the text. Common examples of Normalization are
-//!    the unicode normalization standards, such as `NFD` or `NFKC`.
-//! 2. The `PreTokenizer` is in charge of splitting the text as relevant. The most common way of
-//!    splitting text is simply on whitespaces, to manipulate words.
-//! 3. The `Model` is in charge of doing the actual tokenization. An example of `Model` would be
+//! 1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
+//!    the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
+//! 2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
+//!    splitting text is simply on whitespace.
+//! 3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
 //!    `BPE` or `WordPiece`.
-//! 4. The `PostProcessor` is in charge of post processing the `Encoding`, to add anything relevant
-//!    that a language model would need, like special tokens.
-//!
+//! 4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
+//!    that, for example, a language model would need, such as special tokens.

 #[macro_use]
 extern crate lazy_static;