diff --git a/docs/source/components.rst b/docs/source/components.rst index 8bb1a113..bb4be17b 100644 --- a/docs/source/components.rst +++ b/docs/source/components.rst @@ -4,6 +4,7 @@ Components When building a Tokenizer, you can attach various types of components to this Tokenizer in order to customize its behavior. This page lists most provided components. +.. _normalizers: Normalizers ---------------------------------------------------------------------------------------------------- @@ -71,6 +72,8 @@ The ``Normalizer`` is optional. Sequence([Nmt(), NFKC()]) +.. _pre-tokenizers: + Pre tokenizers ---------------------------------------------------------------------------------------------------- @@ -144,6 +147,8 @@ the ByteLevel) - ``Sequence([Punctuation(), WhitespaceSplit()])`` +.. _models: + Models ---------------------------------------------------------------------------------------------------- @@ -191,6 +196,8 @@ component of a Tokenizer. choosing the most probable one. +.. _post-processors: + PostProcessor ---------------------------------------------------------------------------------------------------- @@ -223,6 +230,8 @@ is the component doing just that. Output: ``"[CLS] I like this [SEP] but not this [SEP]"`` +.. _decoders: + Decoders ---------------------------------------------------------------------------------------------------- diff --git a/docs/source/index.rst b/docs/source/index.rst index d2e8ef72..2f5ba95b 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -37,60 +37,3 @@ Main features: :caption: API Reference api/reference - -Load an existing tokenizer: ----------------------------------------------------------------------------------------------------- - -Loading a previously saved tokenizer is extremely simple and requires a single line of code: - -.. only:: rust - - .. literalinclude:: ../../tokenizers/tests/documentation.rs - :language: rust - :start-after: START load_tokenizer - :end-before: END load_tokenizer - :dedent: 4 - -.. only:: python - - .. literalinclude:: ../../bindings/python/tests/documentation/test_load.py - :language: python - :start-after: START load_tokenizer - :end-before: END load_tokenizer - :dedent: 4 - -.. only:: node - - .. literalinclude:: ../../bindings/node/examples/load.test.js - :language: javascript - :start-after: START load - :end-before: END load - :dedent: 4 - - -Train a tokenizer: ----------------------------------------------------------------------------------------------------- - -.. only:: rust - - .. literalinclude:: ../../tokenizers/tests/documentation.rs - :language: rust - :start-after: START train_tokenizer - :end-before: END train_tokenizer - :dedent: 4 - -.. only:: python - - .. literalinclude:: ../../bindings/python/tests/documentation/test_train.py - :language: python - :start-after: START train_tokenizer - :end-before: END train_tokenizer - :dedent: 4 - -.. only:: node - - .. literalinclude:: ../../bindings/node/examples/train.test.js - :language: javascript - :start-after: START train_tokenizer - :end-before: END train_tokenizer - :dedent: 4 diff --git a/docs/source/pipeline.rst b/docs/source/pipeline.rst index c9870824..6f106fd5 100644 --- a/docs/source/pipeline.rst +++ b/docs/source/pipeline.rst @@ -31,7 +31,7 @@ Normalization Normalization is, in a nutshell, a set of operations you apply to a raw string to make it less random or "cleaner". Common operations include stripping whitespace, removing accented characters -or lowercasing all text. If you're familiar with `unicode normalization +or lowercasing all text. If you're familiar with `Unicode normalization `__, it is also a very common normalization operation applied in most tokenizers. @@ -102,7 +102,7 @@ numbers in their individual digits: from tokenizers.pre_tokenizers import Digits pre_tokenizer = tokenizers.pre_tokenizers.Sequence([ - Whitespace(), + Whitespace(), Digits(individual_digits=True), ]) pre_tokenizer.pre_tokenize_str("Call 911!") @@ -128,16 +128,18 @@ Once the input texts are normalized and pre-tokenized, we can apply the model on This is the part of the pipeline that needs training on your corpus (or that has been trained if you are using a pretrained tokenizer). -The role of the models is to split your "words" into tokens, using the rules it has learned. It's +The role of the model is to split your "words" into tokens, using the rules it has learned. It's also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model. This model is passed along when intializing the :class:`~tokenizers.Tokenizer` so you already know how to customize this part. Currently, the 🤗 Tokenizers library supports: -- :class:`~tokenizers.models.BPE` (Byte-Pair Encoding) -- :class:`~tokenizers.models.Unigram` (for SentencePiece tokenizers) -- :class:`~tokenizers.models.WordLevel` (for just returning the result of the pre-tokenization) -- :class:`~tokenizers.models.WordPiece` (the classic BERT tokenizer) +- :class:`~tokenizers.models.BPE` +- :class:`~tokenizers.models.Unigram` +- :class:`~tokenizers.models.WordLevel` +- :class:`~tokenizers.models.WordPiece` + +For more details about each model and its behavior, you can check `here `__ .. _post-processing: @@ -210,7 +212,10 @@ And the post-processing uses the template we saw in the previous section: bert_tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1", - special_tokens=[("[CLS]", 1), ("[SEP]", 2)], + special_tokens=[ + ("[CLS]", bert_tokenizer.token_to_id("[CLS]")), + ("[SEP]", bert_tokenizer.token_to_id("[SEP]")) + ], ) We can use this tokenizer and train on it on wikitext like in the :doc:`quicktour`: diff --git a/docs/source/quicktour.rst b/docs/source/quicktour.rst index 8454677a..44f59810 100644 --- a/docs/source/quicktour.rst +++ b/docs/source/quicktour.rst @@ -24,7 +24,7 @@ with: Training the tokenizer ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenzier. For more information +In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenizer. For more information about the different type of tokenizers, check out this `guide `__ in the 🤗 Transformers documentation. Here, training the tokenizer means it will learn merge rules by: @@ -84,7 +84,7 @@ to use: tokenizer.train(trainer, files) This should only take a few seconds to train our tokenizer on the full wikitext dataset! Once this -is done, we need to save the model and reinstantiate it with the unkown token, or this token won't +is done, we need to save the model and reinstantiate it with the unknown token, or this token won't be used. This will be simplified in a further release, to let you set the :obj:`unk_token` when first instantiating the model. @@ -100,7 +100,7 @@ To save the tokenizer in one file that contains all its configuration and vocabu tokenizer.save("pretrained/wiki.json") -and you can reload your tokenzier from that file with the :meth:`~tokenizers.Tokenizer.from_file` +and you can reload your tokenizer from that file with the :meth:`~tokenizers.Tokenizer.from_file` class method: .. code-block:: python @@ -119,7 +119,7 @@ Now that we have trained a tokenizer, we can use it on any text we want with the This applied the full pipeline of the tokenizer on the text, returning an :class:`~tokenizers.Encoding` object. To learn more about this pipeline, and how to apply (or -customize) parts of it, check out :doc:`this apge `. +customize) parts of it, check out :doc:`this page `. This :class:`~tokenizers.Encoding` object then has all the attributes you need for your deep learning model (or other). The :obj:`tokens` attribute contains the segmentation of your text in @@ -138,7 +138,7 @@ tokenizer's vocabulary: print(output.ids) # [27194, 16, 93, 11, 5068, 5, 7928, 5083, 6190, 0, 35] -An important feature of the 🤗 Tokenizers library is that it comes with full alignmenbt tracking, +An important feature of the 🤗 Tokenizers library is that it comes with full alignment tracking, meaning you can always get the part of your original sentence that corresponds to a given token. Those are stored in the :obj:`offsets` attribute of our :class:`~tokenizers.Encoding` object. For instance, let's assume we would want to find back what caused the :obj:`"[UNK]"` token to appear, @@ -149,7 +149,7 @@ which is the token at index 9 in the list, we can just ask for the offset at the print(output.offsets[9]) # (26, 27) -and those are the indices that correspond to the smiler in the original sentence: +and those are the indices that correspond to the emoji in the original sentence: .. code-block:: python @@ -183,7 +183,10 @@ Here is how we can set the post-processing to give us the traditional BERT input tokenizer.post_processor = TemplateProcessing single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1", - special_tokens=[("[CLS]", 1), ("[SEP]", 2)], + special_tokens=[ + ("[CLS]", tokenizer.token_to_id("[CLS]")), + ("[SEP]", tokenizer.token_to_id("[SEP]")) + ], ) Let's go over this snippet of code in more details. First we specify the template for single