* Fix typos

Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com>

* Update docs/source/quicktour.rst

* Update docs/source-doc-builder/quicktour.mdx

---------

Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
This commit is contained in:
tinyboxvk
2025-01-09 06:53:20 -04:00
committed by GitHub
parent 6945933829
commit bdfc38b78d
25 changed files with 50 additions and 50 deletions

View File

@ -132,14 +132,14 @@ The ``Normalizer`` is optional.
- Removes all accent symbols in unicode (to be used with NFD for consistency)
- Input: ``é``
Ouput: ``e``
Output: ``e``
* - Replace
- Replaces a custom string or regexp and changes it with given content
- ``Replace("a", "e")`` will behave like this:
Input: ``"banana"``
Ouput: ``"benene"``
Output: ``"benene"``
* - BertNormalizer
- Provides an implementation of the Normalizer used in the original BERT. Options
@ -193,7 +193,7 @@ the ByteLevel)
- Input: ``"Hello my friend, how are you?"``
Ouput: ``"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"``
Output: ``"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"``
* - Whitespace
- Splits on word boundaries (using the following regular expression: ``\w+|[^\w\s]+``
@ -211,13 +211,13 @@ the ByteLevel)
- Will isolate all punctuation characters
- Input: ``"Hello?"``
Ouput: ``"Hello", "?"``
Output: ``"Hello", "?"``
* - Metaspace
- Splits on whitespaces and replaces them with a special char "▁" (U+2581)
- Input: ``"Hello there"``
Ouput: ``"Hello", "▁there"``
Output: ``"Hello", "▁there"``
* - CharDelimiterSplit
- Splits on a given character
@ -225,7 +225,7 @@ the ByteLevel)
Input: ``"Helloxthere"``
Ouput: ``"Hello", "there"``
Output: ``"Hello", "there"``
* - Digits
- Splits the numbers from any other characters.
@ -361,7 +361,7 @@ reverted for example.
a set of visible Unicode characters to represent each byte, so we need a Decoder to
revert this process and get something readable again.
* - Metaspace
- Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifer ```` to
- Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifier ```` to
identify whitespaces, and so this Decoder helps with decoding these.
* - WordPiece
- Reverts the WordPiece Model. This model uses a special identifier ``##`` for continuing

View File

@ -24,7 +24,7 @@ If you are using a unix based OS, the installation should be as simple as runnin
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Or you can easiy update it with the following command::
Or you can easily update it with the following command::
rustup update

View File

@ -253,7 +253,7 @@ been trained if you are using a pretrained tokenizer).
The role of the model is to split your "words" into tokens, using the rules it has learned. It's
also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model.
This model is passed along when intializing the :entity:`Tokenizer` so you already know
This model is passed along when initializing the :entity:`Tokenizer` so you already know
how to customize this part. Currently, the 🤗 Tokenizers library supports:
- :entity:`models.BPE`