diff --git a/docs/source/pipeline.rst b/docs/source/pipeline.rst index 31dab2ac..c9870824 100644 --- a/docs/source/pipeline.rst +++ b/docs/source/pipeline.rst @@ -6,12 +6,13 @@ input text(s) go through the following pipeline: - :ref:`normalization` - :ref:`pre-tokenization` -- :ref:`tokenization` +- :ref:`model` - :ref:`post-processing` We'll see in details what happens during each of those steps in detail, as well as when you want to :ref:`decode ` some token ids, and how the 🤗 Tokenizers library allows you to customize -each of those steps to your needs. +each of those steps to your needs. If you're already familiar with those steps and want to learn by +seeing some code, jump to :ref:`our BERT from scratch example `. For the examples that require a :class:`~tokenizers.Tokenizer`, we will use the tokenizer we trained in the :doc:`quicktour`, which you can load with: @@ -39,7 +40,7 @@ Each normalization operation is represented in the 🤗 Tokenizers library by a :class:`~tokenizers.normalizers.Sequence`. Here is a normalizer applying NFD Unicode normalization and removing accents as an example: -.. code-block:: +.. code-block:: python import tokenizers from tokenizers.normalizers import NFD, StripAccents @@ -49,7 +50,7 @@ and removing accents as an example: You can apply that normalizer to any string with the :meth:`~tokenizers.normalizers.Normalizer.normalize_str` method: -.. code-block:: +.. code-block:: python normalizer.normalize_str("Héllò hôw are ü?") # "Hello how are u?" @@ -57,13 +58,14 @@ You can apply that normalizer to any string with the When building a :class:`~tokenizers.Tokenizer`, you can customize its normalizer by just changing the corresponding attribute: -.. code-block:: +.. code-block:: python tokenizer.normalizer = normalizer Of course, if you change the way a tokenizer applies normalization, you should probably retrain it from scratch afterward. + .. _pre-tokenization: Pre-Tokenization @@ -74,20 +76,200 @@ what your tokens will be at the end of training. A good way to think of this is pre-tokenizer will split your text into "words" and then, your final tokens will be parts of those words. -.. _tokenization: +An easy way to pre-tokenize inputs is to split on spaces and punctuations, which is done by the +:class:`~tokenizers.pre_tokenizers.Whitespace` pre-tokenizer: -Tokenization +.. code-block:: python + + from tokenizers.pre_tokenizers import Whitespace + + pre_tokenizer = Whitespace() + pre_tokenizer.pre_tokenize_str("Hello! How are you? I'm fine, thank you.") + # [("Hello", (0, 5)), ("!", (5, 6)), ("How", (7, 10)), ("are", (11, 14)), ("you", (15, 18)), + # ("?", (18, 19)), ("I", (20, 21)), ("'", (21, 22)), ('m', (22, 23)), ("fine", (24, 28)), + # (",", (28, 29)), ("thank", (30, 35)), ("you", (36, 39)), (".", (39, 40))] + +The output is a list of tuples, with each tuple containing one word and its span in the original +sentence (which is used to determine the final :obj:`offsets` of our :class:`~tokenizers.Encoding`). +Note that splitting on punctuation will split contractions like :obj:`"I'm"` in this example. + +You can combine together any :class:`~tokenizers.pre_tokenizers.PreTokenizer` together. For +instance, here is a pre-tokenizer that will split on space, punctuation and digits, separating +numbers in their individual digits: + +.. code-block:: python + + from tokenizers.pre_tokenizers import Digits + + pre_tokenizer = tokenizers.pre_tokenizers.Sequence([ + Whitespace(), + Digits(individual_digits=True), + ]) + pre_tokenizer.pre_tokenize_str("Call 911!") + # [("Call", (0, 4)), ("9", (5, 6)), ("1", (6, 7)), ("1", (7, 8)), ("!", (8, 9))] + +As we saw in the :doc:`quicktour`, you can customize the pre-tokenizer of a +:class:`~tokenizers.Tokenizer` by just changing the corresponding attribute: + +.. code-block:: python + + tokenizer.pre_tokenizer = pre_tokenizer + +Of course, if you change the way the pre-tokenizer, you should probably retrain your tokenizer from +scratch afterward. + + +.. _model: + +The Model ---------------------------------------------------------------------------------------------------- +Once the input texts are normalized and pre-tokenized, we can apply the model on the pre-tokens. +This is the part of the pipeline that needs training on your corpus (or that has been trained if you +are using a pretrained tokenizer). + +The role of the models is to split your "words" into tokens, using the rules it has learned. It's +also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model. + +This model is passed along when intializing the :class:`~tokenizers.Tokenizer` so you already know +how to customize this part. Currently, the 🤗 Tokenizers library supports: + +- :class:`~tokenizers.models.BPE` (Byte-Pair Encoding) +- :class:`~tokenizers.models.Unigram` (for SentencePiece tokenizers) +- :class:`~tokenizers.models.WordLevel` (for just returning the result of the pre-tokenization) +- :class:`~tokenizers.models.WordPiece` (the classic BERT tokenizer) + .. _post-processing: Post-Processing ---------------------------------------------------------------------------------------------------- +Post-processing is the last step of the tokenization pipeline, to perform any additional +transformation to the :class:`~tokenizers.Encoding` before it's returned, like adding potential +special tokens. + +As we saw in the quick tour, we can customize the post processor of a :class:`~tokenizers.Tokenizer` +by setting the corresponding attribute. For instance, here is how we can post-process to make the +inputs suitable for the BERT model: + +.. code-block:: python + + from tokenizers.processors import TemplateProcessing + + tokenizer.post_processor = TemplateProcessing + single="[CLS] $A [SEP]", + pair="[CLS] $A [SEP] $B:1 [SEP]:1", + special_tokens=[("[CLS]", 1), ("[SEP]", 2)], + ) + +Note that contrarily to the pre-tokenizer or the normalizer, you don't need to retrain a tokenizer +after changing its post-processor. + +.. _example: + +All together: a BERT tokenizer from scratch +---------------------------------------------------------------------------------------------------- + +Let's put all those pieces together to build a BERT tokenizer. First, BERT relies on WordPiece, so +we instantiate a new :class:`~tokenizers.Tokenizer` with this model: + +.. code-block:: python + + from tokenizers import Tokenizer + from tokenizers.models import WordPiece + + bert_tokenizer = Tokenizer(WordPiece()) + +Then we know that BERT preprocesses texts by removing accents and lowercasing. We also use a unicode +normalizer: + +.. code-block:: python + + import tokenizers + from tokenizers.normalizers import Lowercase, NFD, StripAccents + + bert_tokenizer.normalizer = tokenizers.normalizers.Sequence([ + NFD(), Lowercase(), StripAccents() + ]) + +The pre-tokenizer is just splitting on whitespace and punctuation: + +.. code-block:: python + + from tokenizers.pre_tokenizers import Whitespace + + bert_tokenizer.pre_tokenizer = Whitespace() + +And the post-processing uses the template we saw in the previous section: + +.. code-block:: python + + from tokenizers.processors import TemplateProcessing + + bert_tokenizer.post_processor = TemplateProcessing( + single="[CLS] $A [SEP]", + pair="[CLS] $A [SEP] $B:1 [SEP]:1", + special_tokens=[("[CLS]", 1), ("[SEP]", 2)], + ) + +We can use this tokenizer and train on it on wikitext like in the :doc:`quicktour`: + +.. code-block:: python + + from tokenizers.trainers import WordPieceTrainer + + trainer = WordPieceTrainer( + vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"] + ) + files = [f"wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]] + bert_tokenizer.train(trainer, files) + + model_files = bert_tokenizer.model.save("pretrained", "bert-wiki") + bert_tokenizer.model = WordPiece(*model_files, unk_token="[UNK]") + + bert_tokenizer.save("pretrained/bert-wiki.json") + .. _decoding: Decoding ---------------------------------------------------------------------------------------------------- +On top of encoding the input texts, a :class:`~tokenizers.Tokenizer` also has an API for decoding, +that is converting IDs generated by your model back to a text. This is done by the methods +:meth:`~tokenizers.Tokenizer.decode` (for one predicted text) and +:meth:`~tokenizers.Tokenizer.decode_batch` (for a batch of predictions). + +The `decoder` will first convert the IDs back to tokens (using the tokenizer's vocabulary) and +remove all special tokens, then join those tokens with spaces: + +.. code-block:: python + + output = tokenizer.encode("Hello, y'all! How are you 😁 ?") + print(output.ids) + # [27194, 16, 93, 11, 5068, 5, 7928, 5083, 6190, 0, 35] + + tokenizer.decode([27194, 16, 93, 11, 5068, 5, 7928, 5083, 6190, 0, 35]) + # "Hello , y ' all ! How are you ?" + +If you used a model that added special characters to represent subtokens of a given "word" (like +the :obj:`"##"` in WordPiece) you will need to customize the `decoder` to treat them properly. If we +take our previous :obj:`bert_tokenizer` for instance the default decoing will give: + +.. code-block:: python + + output = bert_tokenizer.encode("Welcome to the 🤗 Tokenizers library.") + print(output.tokens) + # ["[CLS]", "welcome", "to", "the", "[UNK]", "tok", "##eni", "##zer", "##s", "library", ".", "[SEP]"] + + bert_tokenizer.decoder(output.ids) + # "welcome to the tok ##eni ##zer ##s library ." + +But by changing it to a proper decoder, we get: + +.. code-block:: python + + bert_tokenizer.decoder = tokenizers.decoders.WordPiece() + bert_tokenizer.decode(output.ids) + # "welcome to the tokenizers library."