From 3591b3ca17aaf552bd725f1088d8f1ff36e8e32b Mon Sep 17 00:00:00 2001 From: Sylvain Gugger Date: Wed, 7 Oct 2020 17:32:04 -0400 Subject: [PATCH] Typos + pipeline beginning --- docs/source/pipeline.rst | 95 ++++++++++++++++++++++++++++++++++++--- docs/source/quicktour.rst | 4 +- 2 files changed, 91 insertions(+), 8 deletions(-) diff --git a/docs/source/pipeline.rst b/docs/source/pipeline.rst index 55ff00e2..31dab2ac 100644 --- a/docs/source/pipeline.rst +++ b/docs/source/pipeline.rst @@ -1,10 +1,93 @@ The tokenization pipeline ==================================================================================================== -TODO: Describe the tokenization pipeline: +When calling :meth:`~tokenizers.Tokenizer.encode` or :meth:`~tokenizers.Tokenizer.encode_batch`, the +input text(s) go through the following pipeline: + +- :ref:`normalization` +- :ref:`pre-tokenization` +- :ref:`tokenization` +- :ref:`post-processing` + +We'll see in details what happens during each of those steps in detail, as well as when you want to +:ref:`decode ` some token ids, and how the 🤗 Tokenizers library allows you to customize +each of those steps to your needs. + +For the examples that require a :class:`~tokenizers.Tokenizer`, we will use the tokenizer we trained +in the :doc:`quicktour`, which you can load with: + +.. code-block:: python + + from tokenizers import Tokenizer + + tokenizer = Tokenizer.from_file("pretrained/wiki.json") + + +.. _normalization: + +Normalization +---------------------------------------------------------------------------------------------------- + +Normalization is, in a nutshell, a set of operations you apply to a raw string to make it less +random or "cleaner". Common operations include stripping whitespace, removing accented characters +or lowercasing all text. If you're familiar with `unicode normalization +`__, it is also a very common normalization operation applied +in most tokenizers. + +Each normalization operation is represented in the 🤗 Tokenizers library by a +:class:`~tokenizers.normalizers.Normalizer`, and you can combine several of those by using a +:class:`~tokenizers.normalizers.Sequence`. Here is a normalizer applying NFD Unicode normalization +and removing accents as an example: + +.. code-block:: + + import tokenizers + from tokenizers.normalizers import NFD, StripAccents + + normalizer = tokenizers.normalizers.Sequence([NFD(), StripAccents()]) + +You can apply that normalizer to any string with the +:meth:`~tokenizers.normalizers.Normalizer.normalize_str` method: + +.. code-block:: + + normalizer.normalize_str("Héllò hôw are ü?") + # "Hello how are u?" + +When building a :class:`~tokenizers.Tokenizer`, you can customize its normalizer by just changing +the corresponding attribute: + +.. code-block:: + + tokenizer.normalizer = normalizer + +Of course, if you change the way a tokenizer applies normalization, you should probably retrain it +from scratch afterward. + +.. _pre-tokenization: + +Pre-Tokenization +---------------------------------------------------------------------------------------------------- + +Pre-tokenization is the act of splitting a text into smaller objects that give an upper bound to +what your tokens will be at the end of training. A good way to think of this is that the +pre-tokenizer will split your text into "words" and then, your final tokens will be parts of those +words. + +.. _tokenization: + +Tokenization +---------------------------------------------------------------------------------------------------- + + +.. _post-processing: + +Post-Processing +---------------------------------------------------------------------------------------------------- + + +.. _decoding: + +Decoding +---------------------------------------------------------------------------------------------------- -- Normalization -- Pre-tokenization -- Tokenization -- Post-processing -- Decoding diff --git a/docs/source/quicktour.rst b/docs/source/quicktour.rst index aff9d745..8454677a 100644 --- a/docs/source/quicktour.rst +++ b/docs/source/quicktour.rst @@ -105,7 +105,7 @@ class method: .. code-block:: python - tokenizer = Tokenizer.from_file("tst-tokenizer/wiki-trained.json") + tokenizer = Tokenizer.from_file("pretrained/wiki.json") Using the tokenizer ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -297,5 +297,5 @@ as long as you have downloaded the file `bert-base-uncased-vocab.txt` with .. note:: - Better support for pretrained tokenziers is coming in a next release, so expect this API to + Better support for pretrained tokenizers is coming in a next release, so expect this API to change soon.