From 3591b3ca17aaf552bd725f1088d8f1ff36e8e32b Mon Sep 17 00:00:00 2001
From: Sylvain Gugger <sylvain.gugger@gmail.com>
Date: Wed, 7 Oct 2020 17:32:04 -0400
Subject: [PATCH] Typos + pipeline beginning

---
 docs/source/pipeline.rst  | 95 ++++++++++++++++++++++++++++++++++++---
 docs/source/quicktour.rst |  4 +-
 2 files changed, 91 insertions(+), 8 deletions(-)
diff --git a/docs/source/pipeline.rst b/docs/source/pipeline.rst
index 55ff00e2..31dab2ac 100644
--- a/docs/source/pipeline.rst
+++ b/docs/source/pipeline.rst
@@ -1,10 +1,93 @@
 The tokenization pipeline
 ====================================================================================================
 
-TODO: Describe the tokenization pipeline:
+When calling :meth:`~tokenizers.Tokenizer.encode` or :meth:`~tokenizers.Tokenizer.encode_batch`, the
+input text(s) go through the following pipeline:
+
+- :ref:`normalization`
+- :ref:`pre-tokenization`
+- :ref:`tokenization`
+- :ref:`post-processing`
+
+We'll see in details what happens during each of those steps in detail, as well as when you want to
+:ref:`decode <decoding>` some token ids, and how the 🤗 Tokenizers library allows you to customize
+each of those steps to your needs. 
+
+For the examples that require a :class:`~tokenizers.Tokenizer`, we will use the tokenizer we trained
+in the :doc:`quicktour`, which you can load with:
+
+.. code-block:: python
+
+    from tokenizers import Tokenizer
+
+    tokenizer = Tokenizer.from_file("pretrained/wiki.json")
+
+
+.. _normalization:
+
+Normalization
+----------------------------------------------------------------------------------------------------
+
+Normalization is, in a nutshell, a set of operations you apply to a raw string to make it less
+random or "cleaner". Common operations include stripping whitespace, removing accented characters
+or lowercasing all text. If you're familiar with `unicode normalization
+<https://unicode.org/reports/tr15>`__, it is also a very common normalization operation applied
+in most tokenizers.
+
+Each normalization operation is represented in the 🤗 Tokenizers library by a
+:class:`~tokenizers.normalizers.Normalizer`, and you can combine several of those by using a
+:class:`~tokenizers.normalizers.Sequence`. Here is a normalizer applying NFD Unicode normalization
+and removing accents as an example:
+
+.. code-block::
+
+    import tokenizers
+    from tokenizers.normalizers import NFD, StripAccents
+
+    normalizer = tokenizers.normalizers.Sequence([NFD(), StripAccents()])
+
+You can apply that normalizer to any string with the
+:meth:`~tokenizers.normalizers.Normalizer.normalize_str` method:
+
+.. code-block::
+
+    normalizer.normalize_str("Héllò hôw are ü?")
+    # "Hello how are u?"
+
+When building a :class:`~tokenizers.Tokenizer`, you can customize its normalizer by just changing
+the corresponding attribute:
+
+.. code-block::
+
+    tokenizer.normalizer = normalizer
+
+Of course, if you change the way a tokenizer applies normalization, you should probably retrain it
+from scratch afterward.
+
+.. _pre-tokenization:
+
+Pre-Tokenization
+----------------------------------------------------------------------------------------------------
+
+Pre-tokenization is the act of splitting a text into smaller objects that give an upper bound to
+what your tokens will be at the end of training. A good way to think of this is that the
+pre-tokenizer will split your text into "words" and then, your final tokens will be parts of those
+words.
+
+.. _tokenization:
+
+Tokenization
+----------------------------------------------------------------------------------------------------
+
+
+.. _post-processing:
+
+Post-Processing
+----------------------------------------------------------------------------------------------------
+
+
+.. _decoding:
+
+Decoding
+----------------------------------------------------------------------------------------------------
 
-- Normalization
-- Pre-tokenization
-- Tokenization
-- Post-processing
-- Decoding
diff --git a/docs/source/quicktour.rst b/docs/source/quicktour.rst
index aff9d745..8454677a 100644
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@@ -105,7 +105,7 @@ class method:
 
 .. code-block:: python
 
-    tokenizer = Tokenizer.from_file("tst-tokenizer/wiki-trained.json")
+    tokenizer = Tokenizer.from_file("pretrained/wiki.json")
 
 Using the tokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -297,5 +297,5 @@ as long as you have downloaded the file `bert-base-uncased-vocab.txt` with
 
 .. note::
 
-    Better support for pretrained tokenziers is coming in a next release, so expect this API to
+    Better support for pretrained tokenizers is coming in a next release, so expect this API to
     change soon.