Doc - Uniform formatting + updated content

2025-08-22 16:25:30 +00:00 · 2020-10-06 10:21:26 -04:00
parent 1da10c17ef
commit 51cd5166d9
3 changed files with 255 additions and 148 deletions
--- a/docs/source/components.rst
+++ b/docs/source/components.rst
@ -0,0 +1,249 @@
+Components
+====================================================================================================
+
+When building a Tokenizer, you can attach various types of components to this Tokenizer in order
+to customize its behavior. This page lists most provided components.
+
+
+Normalizers
+----------------------------------------------------------------------------------------------------
+
+A ``Normalizer`` is in charge of pre-processing the input string in order to normalize it as
+relevant for a given use case. Some common examples of normalization are the Unicode normalization
+algorithms (NFD, NFKD, NFC & NFKC), lowercasing etc...
+The specificity of ``tokenizers`` is that we keep track of the alignment while normalizing. This
+is essential to allow mapping from the generated tokens back to the input text.
+
+The ``Normalizer`` is optional.
+
+.. list-table::
+   :header-rows: 1
+
+   * - Name
+     - Desription
+     - Example
+
+   * - NFD
+     - NFD unicode normalization
+     -
+
+   * - NFKD
+     - NFKD unicode normalization
+     -
+
+   * - NFC
+     - NFC unicode normalization
+     -
+
+   * - NFKC
+     - NFKC unicode normalization
+     -
+
+   * - Lowercase
+     - Replaces all uppercase to lowercase
+     - Input: ``HELLO ὈΔΥΣΣΕΎΣ``
+
+       Output: ``hello ὀδυσσεύς``
+
+   * - Strip
+     - Removes all whitespace characters on the specified sides (left, right or both) of the input
+     - Input: ``" hi "``
+
+       Output: ``"hi"``
+
+   * - StripAccents
+     - Removes all accent symbols in unicode (to be used with NFD for consistency)
+     - Input: ``é``
+
+       Ouput: ``e``
+
+   * - Replace
+     - Replaces a custom string or regexp and changes it with given content
+     - ``Replace("a", "e")`` will behave like this:
+
+       Input: ``"banana"``
+       Ouput: ``"benene"``
+
+   * - Sequence
+     - Composes multiple normalizers that will run in the provided order
+     - Example::
+
+           Sequence([Nmt(), NFKC()])
+
+
+Pre tokenizers
+----------------------------------------------------------------------------------------------------
+
+The ``PreTokenizer`` takes care of splitting the input according to a set of rules. This
+pre-processing lets you ensure that the underlying ``Model`` does not build tokens across multiple
+"splits".
+For example if you don't want to have whitespaces inside a token, then you can have a
+``PreTokenizer`` that splits on these whitespaces.
+
+You can easily combine multiple ``PreTokenizer`` together using a ``Sequence`` (see below).
+The ``PreTokenizer`` is also allowed to modify the string, just like a ``Normalizer`` does. This
+is necessary to allow some complicated algorithms that require to split before normalizing (e.g.
+the ByteLevel)
+
+.. list-table::
+   :header-rows: 1
+
+   * - Name
+     - Description
+     - Example
+
+   * - ByteLevel
+     - Splits on whitespaces while remapping all the bytes to a set of visible characters. This
+       technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties:
+
+        - Since it maps on bytes, a tokenizer using this only requires **256** characters as initial
+          alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode
+          characters.
+        - A consequence of the previous point is that it is absolutely unnecessary to have an
+          unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)
+        - For non ascii characters, it gets completely unreadable, but it works nonetheless!
+
+     - Input: ``"Hello my friend, how are you?"``
+
+       Ouput: ``"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"``
+
+   * - Whitespace
+     - Splits on word boundaries (using the following regular expression: ``\w+|[^\w\s]+``
+     - Input: ``"Hello there!"``
+
+       Output: ``"Hello", "there", "!"``
+
+   * - WhitespaceSplit
+     - Splits on any whitespace character
+     - Input: ``"Hello there!"``
+
+       Output: ``"Hello", "there!"``
+
+   * - Punctuation
+     - Will isolate all punctuation characters
+     - Input: ``"Hello?"``
+
+       Ouput: ``"Hello", "?"``
+
+   * - Metaspace
+     - Splits on whitespaces and replaces them with a special char "▁" (U+2581)
+     - Input: ``"Hello there"``
+
+       Ouput: ``"Hello", "▁there"``
+
+   * - CharDelimiterSplit
+     - Splits on a given character
+     - Example with ``x``:
+
+       Input: ``"Helloxthere"``
+
+       Ouput: ``"Hello", "there"``
+
+   * - Sequence
+     - Lets you compose multiple ``PreTokenizer`` that will be run in the given order
+     - ``Sequence([Punctuation(), WhitespaceSplit()])``
+
+
+Models
+----------------------------------------------------------------------------------------------------
+
+Models are the core algorithms used to actually tokenize, and therefore, they are the only mandatory
+component of a Tokenizer.
+
+.. list-table::
+   :header-rows: 1
+
+   * - Name
+     - Description
+
+   * - WordLevel
+     - This is the "classic" tokenization algorithm. It let's you simply map words to IDs
+       without anything fancy. This has the advantage of being really simple to use and
+       understand, but it requires extremely large vocabularies for a good coverage.
+
+
+       *Using this* ``Model`` *requires the use of a* ``PreTokenizer``. *No choice will be made by
+       this model directly, it simply maps input tokens to IDs*
+
+   * - BPE
+     - One of the most popular subword tokenization algorithm. The Byte-Pair-Encoding works by
+       starting with characters, while merging those that are the most frequently seen together,
+       thus creating new tokens. It then works iteratively to build new tokens out of the most
+       frequent pairs it sees in a corpus.
+
+       BPE is able to build words it has never seen by using multiple subword tokens, and thus
+       requires smaller vocabularies, with less chances of having "unk" (unknown) tokens.
+
+   * - WordPiece
+     - This is a subword tokenization algorithm quite similar to BPE, used mainly by Google in
+       models like BERT. It uses a greedy algorithm, that tries to build long words first, splitting
+       in multiple tokens when entire words don't exist in the vocabulary. This is different from
+       BPE that starts from characters, building bigger tokens as possible.
+
+       It uses the famous ``##`` prefix to identify tokens that are part of a word (ie not starting
+       a word).
+
+   * - Unigram
+     - Unigram is also a subword tokenization algorithm, and works by trying to identify the best
+       set of subword tokens to maximize the probability for a given sentence. This is different
+       from BPE in the way that this is not deterministic based on a set of rules applied
+       sequentially. Instead Unigram will be able to compute multiple ways of tokenizing, while
+       choosing the most probable one.
+
+
+PostProcessor
+----------------------------------------------------------------------------------------------------
+
+After the whole pipeline, we sometimes want to insert some special tokens before feed
+a tokenized string into a model like "[CLS] My horse is amazing [SEP]". The ``PostProcessor``
+is the component doing just that.
+
+.. list-table::
+   :header-rows: 1
+
+   * - Name
+     - Description
+     - Example
+   * - TemplateProcessing
+     - Let's you easily template the post processing, adding special tokens, and specifying
+       the ``type_id`` for each sequence/special token. The template is given two strings
+       representing the single sequence and the pair of sequences, as well as a set of 
+       special tokens to use.
+     - Example, when specifying a template with these values:
+
+            - single: ``"[CLS] $A [SEP]"``
+            - pair: ``"[CLS] $A [SEP] $B [SEP]"``
+            - special tokens:
+
+                - ``"[CLS]"``
+                - ``"[SEP]"``
+
+       Input: ``("I like this", "but not this")``
+
+       Output: ``"[CLS] I like this [SEP] but not this [SEP]"``
+
+
+Decoders
+----------------------------------------------------------------------------------------------------
+
+The Decoder knows how to go from the IDs used by the Tokenizer, back to a readable piece of text.
+Some ``Normalizer`` and ``PreTokenizer`` use special characters or identifiers that need to be
+reverted for example.
+
+.. list-table::
+   :header-rows: 1
+
+   * - Name
+     - Description
+   * - ByteLevel
+     - Reverts the ByteLevel PreTokenizer. This PreTokenizer encodes at the byte-level, using
+       a set of visible Unicode characters to represent each byte, so we need a Decoder to
+       revert this process and get something readable again.
+   * - Metaspace
+     - Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifer ``▁`` to
+       identify whitespaces, and so this Decoder helps with decoding these.
+   * - WordPiece
+     - Reverts the WordPiece Model. This model uses a special identifier ``##`` for continuing
+       subwords, and so this Decoder helps with decoding these.
+
+