Doc - Quick updates and typos

2025-08-22 16:25:30 +00:00 · 2020-10-09 11:04:12 -04:00
parent 403a028275
commit 12af3f2240
4 changed files with 32 additions and 72 deletions
--- a/docs/source/components.rst
+++ b/docs/source/components.rst
@ -4,6 +4,7 @@ Components
 When building a Tokenizer, you can attach various types of components to this Tokenizer in order
 to customize its behavior. This page lists most provided components.

+.. _normalizers:

 Normalizers
 ----------------------------------------------------------------------------------------------------
@ -71,6 +72,8 @@ The ``Normalizer`` is optional.
           Sequence([Nmt(), NFKC()])


+.. _pre-tokenizers:
+
 Pre tokenizers
 ----------------------------------------------------------------------------------------------------

@ -144,6 +147,8 @@ the ByteLevel)
     - ``Sequence([Punctuation(), WhitespaceSplit()])``


+.. _models:
+
 Models
 ----------------------------------------------------------------------------------------------------

@ -191,6 +196,8 @@ component of a Tokenizer.
       choosing the most probable one.


+.. _post-processors:
+
 PostProcessor
 ----------------------------------------------------------------------------------------------------

@ -223,6 +230,8 @@ is the component doing just that.
       Output: ``"[CLS] I like this [SEP] but not this [SEP]"``


+.. _decoders:
+
 Decoders
 ----------------------------------------------------------------------------------------------------

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -37,60 +37,3 @@ Main features:
    :caption: API Reference

    api/reference
-
-Load an existing tokenizer:
----------------------------------------------------------------------------------------------------
-
-Loading a previously saved tokenizer is extremely simple and requires a single line of code:
-
-.. only:: rust
-
-  .. literalinclude:: ../../tokenizers/tests/documentation.rs
-     :language: rust
-     :start-after: START load_tokenizer
-     :end-before: END load_tokenizer
-     :dedent: 4
-
-.. only:: python
-
-  .. literalinclude:: ../../bindings/python/tests/documentation/test_load.py
-     :language: python
-     :start-after: START load_tokenizer
-     :end-before: END load_tokenizer
-     :dedent: 4
-
-.. only:: node
-
-  .. literalinclude:: ../../bindings/node/examples/load.test.js
-     :language: javascript
-     :start-after: START load
-     :end-before: END load
-     :dedent: 4
-
-
-Train a tokenizer:
----------------------------------------------------------------------------------------------------
-
-.. only:: rust
-
-  .. literalinclude:: ../../tokenizers/tests/documentation.rs
-     :language: rust
-     :start-after: START train_tokenizer
-     :end-before: END train_tokenizer
-     :dedent: 4
-
-.. only:: python
-
-  .. literalinclude:: ../../bindings/python/tests/documentation/test_train.py
-     :language: python
-     :start-after: START train_tokenizer
-     :end-before: END train_tokenizer
-     :dedent: 4
-
-.. only:: node
-
-  .. literalinclude:: ../../bindings/node/examples/train.test.js
-     :language: javascript
-     :start-after: START train_tokenizer
-     :end-before: END train_tokenizer
-     :dedent: 4
--- a/docs/source/pipeline.rst
+++ b/docs/source/pipeline.rst
@ -31,7 +31,7 @@ Normalization

 Normalization is, in a nutshell, a set of operations you apply to a raw string to make it less
 random or "cleaner". Common operations include stripping whitespace, removing accented characters
-or lowercasing all text. If you're familiar with `unicode normalization
+or lowercasing all text. If you're familiar with `Unicode normalization
 <https://unicode.org/reports/tr15>`__, it is also a very common normalization operation applied
 in most tokenizers.

@ -102,7 +102,7 @@ numbers in their individual digits:
    from tokenizers.pre_tokenizers import Digits

    pre_tokenizer = tokenizers.pre_tokenizers.Sequence([
-        Whitespace(), 
+        Whitespace(),
        Digits(individual_digits=True),
    ])
    pre_tokenizer.pre_tokenize_str("Call 911!")
@ -128,16 +128,18 @@ Once the input texts are normalized and pre-tokenized, we can apply the model on
 This is the part of the pipeline that needs training on your corpus (or that has been trained if you
 are using a pretrained tokenizer).

-The role of the models is to split your "words" into tokens, using the rules it has learned. It's
+The role of the model is to split your "words" into tokens, using the rules it has learned. It's
 also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model.

 This model is passed along when intializing the :class:`~tokenizers.Tokenizer` so you already know
 how to customize this part. Currently, the 🤗 Tokenizers library supports:

- :class:`~tokenizers.models.BPE` (Byte-Pair Encoding)
- :class:`~tokenizers.models.Unigram` (for SentencePiece tokenizers)
- :class:`~tokenizers.models.WordLevel` (for just returning the result of the pre-tokenization)
- :class:`~tokenizers.models.WordPiece` (the classic BERT tokenizer)
+- :class:`~tokenizers.models.BPE`
+- :class:`~tokenizers.models.Unigram`
+- :class:`~tokenizers.models.WordLevel`
+- :class:`~tokenizers.models.WordPiece`
+
+For more details about each model and its behavior, you can check `here <components.html#models>`__


 .. _post-processing:
@ -210,7 +212,10 @@ And the post-processing uses the template we saw in the previous section:
    bert_tokenizer.post_processor = TemplateProcessing(
        single="[CLS] $A [SEP]",
        pair="[CLS] $A [SEP] $B:1 [SEP]:1",
-        special_tokens=[("[CLS]", 1), ("[SEP]", 2)],
+        special_tokens=[
+            ("[CLS]", bert_tokenizer.token_to_id("[CLS]")),
+            ("[SEP]", bert_tokenizer.token_to_id("[SEP]"))
+        ],
    )

 We can use this tokenizer and train on it on wikitext like in the :doc:`quicktour`:
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@ -24,7 +24,7 @@ with:
 Training the tokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenzier. For more information
+In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenizer. For more information
 about the different type of tokenizers, check out this `guide
 <https://huggingface.co/transformers/tokenizer_summary.html>`__ in the 🤗 Transformers
 documentation. Here, training the tokenizer means it will learn merge rules by:
@ -84,7 +84,7 @@ to use:
    tokenizer.train(trainer, files)

 This should only take a few seconds to train our tokenizer on the full wikitext dataset! Once this
-is done, we need to save the model and reinstantiate it with the unkown token, or this token won't
+is done, we need to save the model and reinstantiate it with the unknown token, or this token won't
 be used. This will be simplified in a further release, to let you set the :obj:`unk_token` when
 first instantiating the model.

@ -100,7 +100,7 @@ To save the tokenizer in one file that contains all its configuration and vocabu

    tokenizer.save("pretrained/wiki.json")

-and you can reload your tokenzier from that file with the :meth:`~tokenizers.Tokenizer.from_file`
+and you can reload your tokenizer from that file with the :meth:`~tokenizers.Tokenizer.from_file`
 class method:

 .. code-block:: python
@ -119,7 +119,7 @@ Now that we have trained a tokenizer, we can use it on any text we want with the

 This applied the full pipeline of the tokenizer on the text, returning an
 :class:`~tokenizers.Encoding` object. To learn more about this pipeline, and how to apply (or
-customize) parts of it, check out :doc:`this apge <pipeline>`.
+customize) parts of it, check out :doc:`this page <pipeline>`.

 This :class:`~tokenizers.Encoding` object then has all the attributes you need for your deep
 learning model (or other). The :obj:`tokens` attribute contains the segmentation of your text in
@ -138,7 +138,7 @@ tokenizer's vocabulary:
    print(output.ids)
    # [27194, 16, 93, 11, 5068, 5, 7928, 5083, 6190, 0, 35]

-An important feature of the 🤗 Tokenizers library is that it comes with full alignmenbt tracking,
+An important feature of the 🤗 Tokenizers library is that it comes with full alignment tracking,
 meaning you can always get the part of your original sentence that corresponds to a given token.
 Those are stored in the :obj:`offsets` attribute of our :class:`~tokenizers.Encoding` object. For
 instance, let's assume we would want to find back what caused the :obj:`"[UNK]"` token to appear,
@ -149,7 +149,7 @@ which is the token at index 9 in the list, we can just ask for the offset at the
    print(output.offsets[9])
    # (26, 27)

-and those are the indices that correspond to the smiler in the original sentence:
+and those are the indices that correspond to the emoji in the original sentence:

 .. code-block:: python

@ -183,7 +183,10 @@ Here is how we can set the post-processing to give us the traditional BERT input
    tokenizer.post_processor = TemplateProcessing
        single="[CLS] $A [SEP]",
        pair="[CLS] $A [SEP] $B:1 [SEP]:1",
-        special_tokens=[("[CLS]", 1), ("[SEP]", 2)],
+        special_tokens=[
+            ("[CLS]", tokenizer.token_to_id("[CLS]")),
+            ("[SEP]", tokenizer.token_to_id("[SEP]"))
+        ],
    )

 Let's go over this snippet of code in more details. First we specify the template for single