Init new docs

2025-08-22 16:25:30 +00:00 · 2022-04-18 09:37:14 +02:00
parent 66c9af26f6
commit 6eda286ab1
19 changed files with 2269 additions and 0 deletions
--- a/docs/source-doc-builder/pipeline.mdx
+++ b/docs/source-doc-builder/pipeline.mdx
@ -0,0 +1,623 @@
+# The tokenization pipeline
+
+When calling `Tokenizer.encode` or
+`Tokenizer.encode_batch`, the input
+text(s) go through the following pipeline:
+
+-   `normalization`
+-   `pre-tokenization`
+-   `model`
+-   `post-processing`
+
+We'll see in details what happens during each of those steps in detail,
+as well as when you want to `decode <decoding>` some token ids, and how the 🤗 Tokenizers library allows you
+to customize each of those steps to your needs. If you're already
+familiar with those steps and want to learn by seeing some code, jump to
+`our BERT from scratch example <example>`.
+
+For the examples that require a `Tokenizer` we will use the tokenizer we trained in the
+`quicktour`, which you can load with:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START reload_tokenizer",
+"end-before": "END reload_tokenizer",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START pipeline_reload_tokenizer",
+"end-before": "END pipeline_reload_tokenizer",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START reload_tokenizer",
+"end-before": "END reload_tokenizer",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+## Normalization
+
+Normalization is, in a nutshell, a set of operations you apply to a raw
+string to make it less random or "cleaner". Common operations include
+stripping whitespace, removing accented characters or lowercasing all
+text. If you're familiar with [Unicode
+normalization](https://unicode.org/reports/tr15), it is also a very
+common normalization operation applied in most tokenizers.
+
+Each normalization operation is represented in the 🤗 Tokenizers library
+by a `Normalizer`, and you can combine
+several of those by using a `normalizers.Sequence`. Here is a normalizer applying NFD Unicode normalization
+and removing accents as an example:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START setup_normalizer",
+"end-before": "END setup_normalizer",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START pipeline_setup_normalizer",
+"end-before": "END pipeline_setup_normalizer",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START setup_normalizer",
+"end-before": "END setup_normalizer",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+You can manually test that normalizer by applying it to any string:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START test_normalizer",
+"end-before": "END test_normalizer",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START pipeline_test_normalizer",
+"end-before": "END pipeline_test_normalizer",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START test_normalizer",
+"end-before": "END test_normalizer",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+When building a `Tokenizer`, you can
+customize its normalizer by just changing the corresponding attribute:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START replace_normalizer",
+"end-before": "END replace_normalizer",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START pipeline_replace_normalizer",
+"end-before": "END pipeline_replace_normalizer",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START replace_normalizer",
+"end-before": "END replace_normalizer",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+Of course, if you change the way a tokenizer applies normalization, you
+should probably retrain it from scratch afterward.
+
+## Pre-Tokenization
+
+Pre-tokenization is the act of splitting a text into smaller objects
+that give an upper bound to what your tokens will be at the end of
+training. A good way to think of this is that the pre-tokenizer will
+split your text into "words" and then, your final tokens will be parts
+of those words.
+
+An easy way to pre-tokenize inputs is to split on spaces and
+punctuations, which is done by the
+`pre_tokenizers.Whitespace`
+pre-tokenizer:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START setup_pre_tokenizer",
+"end-before": "END setup_pre_tokenizer",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START pipeline_setup_pre_tokenizer",
+"end-before": "END pipeline_setup_pre_tokenizer",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START setup_pre_tokenizer",
+"end-before": "END setup_pre_tokenizer",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+The output is a list of tuples, with each tuple containing one word and
+its span in the original sentence (which is used to determine the final
+`offsets` of our `Encoding`). Note that splitting on
+punctuation will split contractions like `"I'm"` in this example.
+
+You can combine together any `PreTokenizer` together. For instance, here is a pre-tokenizer that will
+split on space, punctuation and digits, separating numbers in their
+individual digits:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START combine_pre_tokenizer",
+"end-before": "END combine_pre_tokenizer",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START pipeline_combine_pre_tokenizer",
+"end-before": "END pipeline_combine_pre_tokenizer",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START combine_pre_tokenizer",
+"end-before": "END combine_pre_tokenizer",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+As we saw in the `quicktour`, you can
+customize the pre-tokenizer of a `Tokenizer` by just changing the corresponding attribute:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START replace_pre_tokenizer",
+"end-before": "END replace_pre_tokenizer",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START pipeline_replace_pre_tokenizer",
+"end-before": "END pipeline_replace_pre_tokenizer",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START replace_pre_tokenizer",
+"end-before": "END replace_pre_tokenizer",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+Of course, if you change the way the pre-tokenizer, you should probably
+retrain your tokenizer from scratch afterward.
+
+## Model
+
+Once the input texts are normalized and pre-tokenized, the
+`Tokenizer` applies the model on the
+pre-tokens. This is the part of the pipeline that needs training on your
+corpus (or that has been trained if you are using a pretrained
+tokenizer).
+
+The role of the model is to split your "words" into tokens, using the
+rules it has learned. It's also responsible for mapping those tokens to
+their corresponding IDs in the vocabulary of the model.
+
+This model is passed along when intializing the
+`Tokenizer` so you already know how to
+customize this part. Currently, the 🤗 Tokenizers library supports:
+
+-   `models.BPE`
+-   `models.Unigram`
+-   `models.WordLevel`
+-   `models.WordPiece`
+
+For more details about each model and its behavior, you can check
+[here](components.html#models)
+
+## Post-Processing
+
+Post-processing is the last step of the tokenization pipeline, to
+perform any additional transformation to the
+`Encoding` before it's returned, like
+adding potential special tokens.
+
+As we saw in the quick tour, we can customize the post processor of a
+`Tokenizer` by setting the
+corresponding attribute. For instance, here is how we can post-process
+to make the inputs suitable for the BERT model:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START setup_processor",
+"end-before": "END setup_processor",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START pipeline_setup_processor",
+"end-before": "END pipeline_setup_processor",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START setup_processor",
+"end-before": "END setup_processor",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+Note that contrarily to the pre-tokenizer or the normalizer, you don't
+need to retrain a tokenizer after changing its post-processor.
+
+## All together: a BERT tokenizer from scratch
+
+Let's put all those pieces together to build a BERT tokenizer. First,
+BERT relies on WordPiece, so we instantiate a new
+`Tokenizer` with this model:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START bert_setup_tokenizer",
+"end-before": "END bert_setup_tokenizer",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START bert_setup_tokenizer",
+"end-before": "END bert_setup_tokenizer",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START bert_setup_tokenizer",
+"end-before": "END bert_setup_tokenizer",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+Then we know that BERT preprocesses texts by removing accents and
+lowercasing. We also use a unicode normalizer:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START bert_setup_normalizer",
+"end-before": "END bert_setup_normalizer",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START bert_setup_normalizer",
+"end-before": "END bert_setup_normalizer",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START bert_setup_normalizer",
+"end-before": "END bert_setup_normalizer",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+The pre-tokenizer is just splitting on whitespace and punctuation:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START bert_setup_pre_tokenizer",
+"end-before": "END bert_setup_pre_tokenizer",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START bert_setup_pre_tokenizer",
+"end-before": "END bert_setup_pre_tokenizer",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START bert_setup_pre_tokenizer",
+"end-before": "END bert_setup_pre_tokenizer",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+And the post-processing uses the template we saw in the previous
+section:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START bert_setup_processor",
+"end-before": "END bert_setup_processor",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START bert_setup_processor",
+"end-before": "END bert_setup_processor",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START bert_setup_processor",
+"end-before": "END bert_setup_processor",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+We can use this tokenizer and train on it on wikitext like in the
+`quicktour`:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START bert_train_tokenizer",
+"end-before": "END bert_train_tokenizer",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START bert_train_tokenizer",
+"end-before": "END bert_train_tokenizer",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START bert_train_tokenizer",
+"end-before": "END bert_train_tokenizer",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+## Decoding
+
+On top of encoding the input texts, a `Tokenizer` also has an API for decoding, that is converting IDs
+generated by your model back to a text. This is done by the methods
+`Tokenizer.decode` (for one predicted text) and `Tokenizer.decode_batch` (for a batch of predictions).
+
+The [decoder]{.title-ref} will first convert the IDs back to tokens
+(using the tokenizer's vocabulary) and remove all special tokens, then
+join those tokens with spaces:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START test_decoding",
+"end-before": "END test_decoding",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START pipeline_test_decoding",
+"end-before": "END pipeline_test_decoding",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START test_decoding",
+"end-before": "END test_decoding",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+If you used a model that added special characters to represent subtokens
+of a given "word" (like the `"##"` in
+WordPiece) you will need to customize the [decoder]{.title-ref} to treat
+them properly. If we take our previous `bert_tokenizer` for instance the
+default decoing will give:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START bert_test_decoding",
+"end-before": "END bert_test_decoding",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START bert_test_decoding",
+"end-before": "END bert_test_decoding",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START bert_test_decoding",
+"end-before": "END bert_test_decoding",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>
+
+But by changing it to a proper decoder, we get:
+
+<tokenizerslangcontent>
+<python>
+<literalinclude>
+{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
+"language": "python",
+"start-after": "START bert_proper_decoding",
+"end-before": "END bert_proper_decoding",
+"dedent": 8}
+</literalinclude>
+</python>
+<rust>
+<literalinclude>
+{"path": "../../tokenizers/tests/documentation.rs",
+"language": "rust",
+"start-after": "START bert_proper_decoding",
+"end-before": "END bert_proper_decoding",
+"dedent": 4}
+</literalinclude>
+</rust>
+<node>
+<literalinclude>
+{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
+"language": "js",
+"start-after": "START bert_proper_decoding",
+"end-before": "END bert_proper_decoding",
+"dedent": 8}
+</literalinclude>
+</node>
+</tokenizerslangcontent>