Doc - Update Normalizer part of the Pipeline page

2025-08-22 16:25:30 +00:00 · 2020-10-27 18:45:16 -04:00
parent ab7bae466a
commit 13a80050f0
6 changed files with 207 additions and 21 deletions
--- a/docs/source/_ext/rust_doc.py
+++ b/docs/source/_ext/rust_doc.py
@ -26,6 +26,8 @@ class RustRef:
            l, title = self.make_func_link(parts, title)
        if doctype == "meth":
            l, title = self.make_meth_link(parts, title)
+        if doctype == "trait":
+            l, title = self.make_trait_link(parts, title)
        link += l

        node = nodes.reference(internal=False, refuri=link, text=title)
@ -72,11 +74,23 @@ class RustRef:

        return link, title

+    def make_trait_link(self, parts, title):
+        link = ""
+        trait_name = parts[-1]
+
+        path = parts[:-1]
+        for p in path:
+            link += f"/{p}"
+        link += f"/trait.{trait_name}.html"
+
+        return link, title
+

 def setup(app):
    app.add_role("rust:struct", RustRef())
    app.add_role("rust:func", RustRef())
    app.add_role("rust:meth", RustRef())
+    app.add_role("rust:trait", RustRef())

    return {
        "version": "0.1",
--- a/docs/source/pipeline.rst
+++ b/docs/source/pipeline.rst
@ -1,8 +1,8 @@
 The tokenization pipeline
 ====================================================================================================

-When calling :meth:`~tokenizers.Tokenizer.encode` or :meth:`~tokenizers.Tokenizer.encode_batch`, the
-input text(s) go through the following pipeline:
+When calling :entity:`Tokenizer.encode` or :entity:`Tokenizer.encode_batch`, the input text(s) go
+through the following pipeline:

 - :ref:`normalization`
 - :ref:`pre-tokenization`
@ -14,14 +14,32 @@ We'll see in details what happens during each of those steps in detail, as well
 each of those steps to your needs. If you're already familiar with those steps and want to learn by
 seeing some code, jump to :ref:`our BERT from scratch example <example>`.

-For the examples that require a :class:`~tokenizers.Tokenizer`, we will use the tokenizer we trained
+For the examples that require a :entity:`Tokenizer`, we will use the tokenizer we trained
 in the :doc:`quicktour`, which you can load with:

-.. code-block:: python
+.. only:: python

-    from tokenizers import Tokenizer
+    .. literalinclude:: ../../bindings/python/tests/documentation/test_pipeline.py
+        :language: python
+        :start-after: START reload_tokenizer
+        :end-before: END reload_tokenizer
+        :dedent: 8

-    tokenizer = Tokenizer.from_file("pretrained/wiki.json")
+.. only:: rust
+
+    .. literalinclude:: ../../tokenizers/tests/documentation.rs
+        :language: rust
+        :start-after: START pipeline_reload_tokenizer
+        :end-before: END pipeline_reload_tokenizer
+        :dedent: 4
+
+.. only:: node
+
+    .. literalinclude:: ../../bindings/node/examples/documentation/pipeline.test.ts
+        :language: javascript
+        :start-after: START reload_tokenizer
+        :end-before: END reload_tokenizer
+        :dedent: 8


 .. _normalization:
@ -36,31 +54,88 @@ or lowercasing all text. If you're familiar with `Unicode normalization
 in most tokenizers.

 Each normalization operation is represented in the 🤗 Tokenizers library by a
-:class:`~tokenizers.normalizers.Normalizer`, and you can combine several of those by using a
-:class:`~tokenizers.normalizers.Sequence`. Here is a normalizer applying NFD Unicode normalization
+:entity:`Normalizer`, and you can combine several of those by using a
+:entity:`normalizers.Sequence`. Here is a normalizer applying NFD Unicode normalization
 and removing accents as an example:

-.. code-block:: python
+.. only:: python

-    import tokenizers
-    from tokenizers.normalizers import NFD, StripAccents
+    .. literalinclude:: ../../bindings/python/tests/documentation/test_pipeline.py
+        :language: python
+        :start-after: START setup_normalizer
+        :end-before: END setup_normalizer
+        :dedent: 8

-    normalizer = tokenizers.normalizers.Sequence([NFD(), StripAccents()])
+.. only:: rust

-You can apply that normalizer to any string with the
-:meth:`~tokenizers.normalizers.Normalizer.normalize_str` method:
+    .. literalinclude:: ../../tokenizers/tests/documentation.rs
+        :language: rust
+        :start-after: START pipeline_setup_normalizer
+        :end-before: END pipeline_setup_normalizer
+        :dedent: 4

-.. code-block:: python
+.. only:: node

-    normalizer.normalize_str("Héllò hôw are ü?")
-    # "Hello how are u?"
+    .. literalinclude:: ../../bindings/node/examples/documentation/pipeline.test.ts
+        :language: javascript
+        :start-after: START setup_normalizer
+        :end-before: END setup_normalizer
+        :dedent: 8

-When building a :class:`~tokenizers.Tokenizer`, you can customize its normalizer by just changing
+
+You can manually test that normalizer by applying it to any string:
+
+.. only:: python
+
+    .. literalinclude:: ../../bindings/python/tests/documentation/test_pipeline.py
+        :language: python
+        :start-after: START test_normalizer
+        :end-before: END test_normalizer
+        :dedent: 8
+
+.. only:: rust
+
+    .. literalinclude:: ../../tokenizers/tests/documentation.rs
+        :language: rust
+        :start-after: START pipeline_test_normalizer
+        :end-before: END pipeline_test_normalizer
+        :dedent: 4
+
+.. only:: node
+
+    .. literalinclude:: ../../bindings/node/examples/documentation/pipeline.test.ts
+        :language: javascript
+        :start-after: START test_normalizer
+        :end-before: END test_normalizer
+        :dedent: 8
+
+
+When building a :entity:`Tokenizer`, you can customize its normalizer by just changing
 the corresponding attribute:

-.. code-block:: python
+.. only:: python

-    tokenizer.normalizer = normalizer
+    .. literalinclude:: ../../bindings/python/tests/documentation/test_pipeline.py
+        :language: python
+        :start-after: START replace_normalizer
+        :end-before: END replace_normalizer
+        :dedent: 8
+
+.. only:: rust
+
+    .. literalinclude:: ../../tokenizers/tests/documentation.rs
+        :language: rust
+        :start-after: START pipeline_replace_normalizer
+        :end-before: END pipeline_replace_normalizer
+        :dedent: 4
+
+.. only:: node
+
+    .. literalinclude:: ../../bindings/node/examples/documentation/pipeline.test.ts
+        :language: javascript
+        :start-after: START replace_normalizer
+        :end-before: END replace_normalizer
+        :dedent: 8

 Of course, if you change the way a tokenizer applies normalization, you should probably retrain it
 from scratch afterward.