Fix typos (#1715)

* Fix typos Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com> * Update docs/source/quicktour.rst * Update docs/source-doc-builder/quicktour.mdx --------- Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-08-22 16:25:30 +00:00 · 2025-01-09 06:53:20 -04:00
parent 6945933829
commit bdfc38b78d
25 changed files with 50 additions and 50 deletions
--- a/docs/source-doc-builder/components.mdx
+++ b/docs/source-doc-builder/components.mdx
@ -25,8 +25,8 @@ The `Normalizer` is optional.
 | NFKC | NFKC unicode normalization |  |
 | Lowercase | Replaces all uppercase to lowercase | Input: `HELLO ὈΔΥΣΣΕΎΣ` <br> Output: `hello`ὀδυσσεύς`  |
 | Strip | Removes all whitespace characters on the specified sides (left, right or both) of the input | Input: `"`hi`"` <br> Output: `"hi"`  |
-| StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Ouput: `e`  |
-| Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Ouput: `"benene"`  |
+| StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Output: `e`  |
+| Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Output: `"benene"`  |
 | BertNormalizer | Provides an implementation of the Normalizer used in the original BERT. Options that can be set are: <ul> <li>clean_text</li> <li>handle_chinese_chars</li> <li>strip_accents</li> <li>lowercase</li> </ul>  |  |
 | Sequence | Composes multiple normalizers that will run in the provided order | `Sequence([NFKC(), Lowercase()])` |
 </python>
@ -39,8 +39,8 @@ The `Normalizer` is optional.
 | NFKC | NFKC unicode normalization |  |
 | Lowercase | Replaces all uppercase to lowercase | Input: `HELLO ὈΔΥΣΣΕΎΣ` <br> Output: `hello`ὀδυσσεύς`  |
 | Strip | Removes all whitespace characters on the specified sides (left, right or both) of the input | Input: `"`hi`"` <br> Output: `"hi"`  |
-| StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Ouput: `e`  |
-| Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Ouput: `"benene"`  |
+| StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Output: `e`  |
+| Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Output: `"benene"`  |
 | BertNormalizer | Provides an implementation of the Normalizer used in the original BERT. Options that can be set are: <ul> <li>clean_text</li> <li>handle_chinese_chars</li> <li>strip_accents</li> <li>lowercase</li> </ul>  |  |
 | Sequence | Composes multiple normalizers that will run in the provided order | `Sequence::new(vec![NFKC, Lowercase])` |
 </rust>
@ -53,8 +53,8 @@ The `Normalizer` is optional.
 | NFKC | NFKC unicode normalization |  |
 | Lowercase | Replaces all uppercase to lowercase | Input: `HELLO ὈΔΥΣΣΕΎΣ` <br> Output: `hello`ὀδυσσεύς`  |
 | Strip | Removes all whitespace characters on the specified sides (left, right or both) of the input | Input: `"`hi`"` <br> Output: `"hi"`  |
-| StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Ouput: `e`  |
-| Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Ouput: `"benene"`  |
+| StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Output: `e`  |
+| Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Output: `"benene"`  |
 | BertNormalizer | Provides an implementation of the Normalizer used in the original BERT. Options that can be set are: <ul> <li>cleanText</li> <li>handleChineseChars</li> <li>stripAccents</li> <li>lowercase</li> </ul>  |  |
 | Sequence | Composes multiple normalizers that will run in the provided order | |
 </node>
@ -78,12 +78,12 @@ the ByteLevel)
 <python>
 | Name | Description | Example |
 | :--- | :--- | :--- |
-| ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Ouput: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"`  |
+| ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Output: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"`  |
 | Whitespace | Splits on word boundaries (using the following regular expression: `\w+&#124;[^\w\s]+` | Input: `"Hello there!"` <br> Output: `"Hello", "there", "!"`  |
 | WhitespaceSplit | Splits on any whitespace character | Input: `"Hello there!"` <br> Output: `"Hello", "there!"`  |
-| Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Ouput: `"Hello", "?"`  |
-| Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Ouput: `"Hello", "▁there"`  |
-| CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Ouput: `"Hello", "there"`  |
+| Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Output: `"Hello", "?"`  |
+| Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Output: `"Hello", "▁there"`  |
+| CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Output: `"Hello", "there"`  |
 | Digits | Splits the numbers from any other characters. | Input: `"Hello123there"` <br>  Output: ``"Hello", "123", "there"``  |
 | Split | Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary. <ul> <li>pattern should be either a custom string or regexp.</li> <li>behavior should be one of: <ul><li>removed</li><li>isolated</li><li>merged_with_previous</li><li>merged_with_next</li><li>contiguous</li></ul></li> <li>invert should be a boolean flag.</li> </ul> | Example with pattern = ` `, behavior = `"isolated"`, invert = `False`: <br> Input: `"Hello, how are you?"` <br> Output: `"Hello,", " ", "how", " ", "are", " ", "you?"` |
 | Sequence | Lets you compose multiple `PreTokenizer` that will be run in the given order | `Sequence([Punctuation(), WhitespaceSplit()])` |
@ -91,12 +91,12 @@ the ByteLevel)
 <rust>
 | Name | Description | Example |
 | :--- | :--- | :--- |
-| ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Ouput: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"`  |
+| ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Output: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"`  |
 | Whitespace | Splits on word boundaries (using the following regular expression: `\w+&#124;[^\w\s]+` | Input: `"Hello there!"` <br> Output: `"Hello", "there", "!"`  |
 | WhitespaceSplit | Splits on any whitespace character | Input: `"Hello there!"` <br> Output: `"Hello", "there!"`  |
-| Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Ouput: `"Hello", "?"`  |
-| Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Ouput: `"Hello", "▁there"`  |
-| CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Ouput: `"Hello", "there"`  |
+| Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Output: `"Hello", "?"`  |
+| Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Output: `"Hello", "▁there"`  |
+| CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Output: `"Hello", "there"`  |
 | Digits | Splits the numbers from any other characters. | Input: `"Hello123there"` <br>  Output: ``"Hello", "123", "there"``  |
 | Split | Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary. <ul> <li>pattern should be either a custom string or regexp.</li> <li>behavior should be one of: <ul><li>Removed</li><li>Isolated</li><li>MergedWithPrevious</li><li>MergedWithNext</li><li>Contiguous</li></ul></li> <li>invert should be a boolean flag.</li> </ul> | Example with pattern = ` `, behavior = `"isolated"`, invert = `False`: <br> Input: `"Hello, how are you?"` <br> Output: `"Hello,", " ", "how", " ", "are", " ", "you?"` |
 | Sequence | Lets you compose multiple `PreTokenizer` that will be run in the given order | `Sequence::new(vec![Punctuation, WhitespaceSplit])` |
@ -104,12 +104,12 @@ the ByteLevel)
 <node>
 | Name | Description | Example |
 | :--- | :--- | :--- |
-| ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Ouput: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"`  |
+| ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Output: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"`  |
 | Whitespace | Splits on word boundaries (using the following regular expression: `\w+&#124;[^\w\s]+` | Input: `"Hello there!"` <br> Output: `"Hello", "there", "!"`  |
 | WhitespaceSplit | Splits on any whitespace character | Input: `"Hello there!"` <br> Output: `"Hello", "there!"`  |
-| Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Ouput: `"Hello", "?"`  |
-| Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Ouput: `"Hello", "▁there"`  |
-| CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Ouput: `"Hello", "there"`  |
+| Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Output: `"Hello", "?"`  |
+| Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Output: `"Hello", "▁there"`  |
+| CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Output: `"Hello", "there"`  |
 | Digits | Splits the numbers from any other characters. | Input: `"Hello123there"` <br>  Output: ``"Hello", "123", "there"``  |
 | Split | Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary. <ul> <li>pattern should be either a custom string or regexp.</li> <li>behavior should be one of: <ul><li>removed</li><li>isolated</li><li>mergedWithPrevious</li><li>mergedWithNext</li><li>contiguous</li></ul></li> <li>invert should be a boolean flag.</li> </ul> | Example with pattern = ` `, behavior = `"isolated"`, invert = `False`: <br> Input: `"Hello, how are you?"` <br> Output: `"Hello,", " ", "how", " ", "are", " ", "you?"` |
 | Sequence | Lets you compose multiple `PreTokenizer` that will be run in the given order | |
@ -148,5 +148,5 @@ special characters or identifiers that need to be reverted for example.
 | Name | Description |
 | :--- | :--- |
 | ByteLevel | Reverts the ByteLevel PreTokenizer. This PreTokenizer encodes at the byte-level, using a set of visible Unicode characters to represent each byte, so we need a Decoder to revert this process and get something readable again. |
-| Metaspace | Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifer `▁` to identify whitespaces, and so this Decoder helps with decoding these. |
+| Metaspace | Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifier `▁` to identify whitespaces, and so this Decoder helps with decoding these. |
 | WordPiece | Reverts the WordPiece Model. This model uses a special identifier `##` for continuing subwords, and so this Decoder helps with decoding these. |
--- a/docs/source-doc-builder/installation.mdx
+++ b/docs/source-doc-builder/installation.mdx
@ -32,7 +32,7 @@ as running:
 curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
 ```

-Or you can easiy update it with the following command:
+Or you can easily update it with the following command:

 ```bash
 rustup update
--- a/docs/source-doc-builder/pipeline.mdx
+++ b/docs/source-doc-builder/pipeline.mdx
@ -290,7 +290,7 @@ The role of the model is to split your "words" into tokens, using the
 rules it has learned. It's also responsible for mapping those tokens to
 their corresponding IDs in the vocabulary of the model.

-This model is passed along when intializing the
+This model is passed along when initializing the
 `Tokenizer` so you already know how to
 customize this part. Currently, the 🤗 Tokenizers library supports:

--- a/docs/source/components.rst
+++ b/docs/source/components.rst
@ -132,14 +132,14 @@ The ``Normalizer`` is optional.
     - Removes all accent symbols in unicode (to be used with NFD for consistency)
     - Input: ``é``

-       Ouput: ``e``
+       Output: ``e``

   * - Replace
     - Replaces a custom string or regexp and changes it with given content
     - ``Replace("a", "e")`` will behave like this:

       Input: ``"banana"``
-       Ouput: ``"benene"``
+       Output: ``"benene"``

   * - BertNormalizer
     - Provides an implementation of the Normalizer used in the original BERT. Options
@ -193,7 +193,7 @@ the ByteLevel)

     - Input: ``"Hello my friend, how are you?"``

-       Ouput: ``"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"``
+       Output: ``"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"``

   * - Whitespace
     - Splits on word boundaries (using the following regular expression: ``\w+|[^\w\s]+``
@ -211,13 +211,13 @@ the ByteLevel)
     - Will isolate all punctuation characters
     - Input: ``"Hello?"``

-       Ouput: ``"Hello", "?"``
+       Output: ``"Hello", "?"``

   * - Metaspace
     - Splits on whitespaces and replaces them with a special char "▁" (U+2581)
     - Input: ``"Hello there"``

-       Ouput: ``"Hello", "▁there"``
+       Output: ``"Hello", "▁there"``

   * - CharDelimiterSplit
     - Splits on a given character
@ -225,7 +225,7 @@ the ByteLevel)

       Input: ``"Helloxthere"``

-       Ouput: ``"Hello", "there"``
+       Output: ``"Hello", "there"``

   * - Digits
     - Splits the numbers from any other characters.
@ -361,7 +361,7 @@ reverted for example.
       a set of visible Unicode characters to represent each byte, so we need a Decoder to
       revert this process and get something readable again.
   * - Metaspace
-     - Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifer ``▁`` to
+     - Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifier ``▁`` to
       identify whitespaces, and so this Decoder helps with decoding these.
   * - WordPiece
     - Reverts the WordPiece Model. This model uses a special identifier ``##`` for continuing
--- a/docs/source/installation/python.inc
+++ b/docs/source/installation/python.inc
@ -24,7 +24,7 @@ If you are using a unix based OS, the installation should be as simple as runnin

    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

-Or you can easiy update it with the following command::
+Or you can easily update it with the following command::

    rustup update

--- a/docs/source/pipeline.rst
+++ b/docs/source/pipeline.rst
@ -253,7 +253,7 @@ been trained if you are using a pretrained tokenizer).
 The role of the model is to split your "words" into tokens, using the rules it has learned. It's
 also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model.

-This model is passed along when intializing the :entity:`Tokenizer` so you already know
+This model is passed along when initializing the :entity:`Tokenizer` so you already know
 how to customize this part. Currently, the 🤗 Tokenizers library supports:

 - :entity:`models.BPE`