mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Fix typos (#1715)
* Fix typos Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com> * Update docs/source/quicktour.rst * Update docs/source-doc-builder/quicktour.mdx --------- Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
This commit is contained in:
@ -132,14 +132,14 @@ The ``Normalizer`` is optional.
|
||||
- Removes all accent symbols in unicode (to be used with NFD for consistency)
|
||||
- Input: ``é``
|
||||
|
||||
Ouput: ``e``
|
||||
Output: ``e``
|
||||
|
||||
* - Replace
|
||||
- Replaces a custom string or regexp and changes it with given content
|
||||
- ``Replace("a", "e")`` will behave like this:
|
||||
|
||||
Input: ``"banana"``
|
||||
Ouput: ``"benene"``
|
||||
Output: ``"benene"``
|
||||
|
||||
* - BertNormalizer
|
||||
- Provides an implementation of the Normalizer used in the original BERT. Options
|
||||
@ -193,7 +193,7 @@ the ByteLevel)
|
||||
|
||||
- Input: ``"Hello my friend, how are you?"``
|
||||
|
||||
Ouput: ``"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"``
|
||||
Output: ``"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"``
|
||||
|
||||
* - Whitespace
|
||||
- Splits on word boundaries (using the following regular expression: ``\w+|[^\w\s]+``
|
||||
@ -211,13 +211,13 @@ the ByteLevel)
|
||||
- Will isolate all punctuation characters
|
||||
- Input: ``"Hello?"``
|
||||
|
||||
Ouput: ``"Hello", "?"``
|
||||
Output: ``"Hello", "?"``
|
||||
|
||||
* - Metaspace
|
||||
- Splits on whitespaces and replaces them with a special char "▁" (U+2581)
|
||||
- Input: ``"Hello there"``
|
||||
|
||||
Ouput: ``"Hello", "▁there"``
|
||||
Output: ``"Hello", "▁there"``
|
||||
|
||||
* - CharDelimiterSplit
|
||||
- Splits on a given character
|
||||
@ -225,7 +225,7 @@ the ByteLevel)
|
||||
|
||||
Input: ``"Helloxthere"``
|
||||
|
||||
Ouput: ``"Hello", "there"``
|
||||
Output: ``"Hello", "there"``
|
||||
|
||||
* - Digits
|
||||
- Splits the numbers from any other characters.
|
||||
@ -361,7 +361,7 @@ reverted for example.
|
||||
a set of visible Unicode characters to represent each byte, so we need a Decoder to
|
||||
revert this process and get something readable again.
|
||||
* - Metaspace
|
||||
- Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifer ``▁`` to
|
||||
- Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifier ``▁`` to
|
||||
identify whitespaces, and so this Decoder helps with decoding these.
|
||||
* - WordPiece
|
||||
- Reverts the WordPiece Model. This model uses a special identifier ``##`` for continuing
|
||||
|
@ -24,7 +24,7 @@ If you are using a unix based OS, the installation should be as simple as runnin
|
||||
|
||||
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
||||
|
||||
Or you can easiy update it with the following command::
|
||||
Or you can easily update it with the following command::
|
||||
|
||||
rustup update
|
||||
|
||||
|
@ -253,7 +253,7 @@ been trained if you are using a pretrained tokenizer).
|
||||
The role of the model is to split your "words" into tokens, using the rules it has learned. It's
|
||||
also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model.
|
||||
|
||||
This model is passed along when intializing the :entity:`Tokenizer` so you already know
|
||||
This model is passed along when initializing the :entity:`Tokenizer` so you already know
|
||||
how to customize this part. Currently, the 🤗 Tokenizers library supports:
|
||||
|
||||
- :entity:`models.BPE`
|
||||
|
Reference in New Issue
Block a user