mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Doc - Uniform formatting + updated content
This commit is contained in:
249
docs/source/components.rst
Normal file
249
docs/source/components.rst
Normal file
@ -0,0 +1,249 @@
|
||||
Components
|
||||
====================================================================================================
|
||||
|
||||
When building a Tokenizer, you can attach various types of components to this Tokenizer in order
|
||||
to customize its behavior. This page lists most provided components.
|
||||
|
||||
|
||||
Normalizers
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
||||
A ``Normalizer`` is in charge of pre-processing the input string in order to normalize it as
|
||||
relevant for a given use case. Some common examples of normalization are the Unicode normalization
|
||||
algorithms (NFD, NFKD, NFC & NFKC), lowercasing etc...
|
||||
The specificity of ``tokenizers`` is that we keep track of the alignment while normalizing. This
|
||||
is essential to allow mapping from the generated tokens back to the input text.
|
||||
|
||||
The ``Normalizer`` is optional.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Name
|
||||
- Desription
|
||||
- Example
|
||||
|
||||
* - NFD
|
||||
- NFD unicode normalization
|
||||
-
|
||||
|
||||
* - NFKD
|
||||
- NFKD unicode normalization
|
||||
-
|
||||
|
||||
* - NFC
|
||||
- NFC unicode normalization
|
||||
-
|
||||
|
||||
* - NFKC
|
||||
- NFKC unicode normalization
|
||||
-
|
||||
|
||||
* - Lowercase
|
||||
- Replaces all uppercase to lowercase
|
||||
- Input: ``HELLO ὈΔΥΣΣΕΎΣ``
|
||||
|
||||
Output: ``hello ὀδυσσεύς``
|
||||
|
||||
* - Strip
|
||||
- Removes all whitespace characters on the specified sides (left, right or both) of the input
|
||||
- Input: ``" hi "``
|
||||
|
||||
Output: ``"hi"``
|
||||
|
||||
* - StripAccents
|
||||
- Removes all accent symbols in unicode (to be used with NFD for consistency)
|
||||
- Input: ``é``
|
||||
|
||||
Ouput: ``e``
|
||||
|
||||
* - Replace
|
||||
- Replaces a custom string or regexp and changes it with given content
|
||||
- ``Replace("a", "e")`` will behave like this:
|
||||
|
||||
Input: ``"banana"``
|
||||
Ouput: ``"benene"``
|
||||
|
||||
* - Sequence
|
||||
- Composes multiple normalizers that will run in the provided order
|
||||
- Example::
|
||||
|
||||
Sequence([Nmt(), NFKC()])
|
||||
|
||||
|
||||
Pre tokenizers
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
||||
The ``PreTokenizer`` takes care of splitting the input according to a set of rules. This
|
||||
pre-processing lets you ensure that the underlying ``Model`` does not build tokens across multiple
|
||||
"splits".
|
||||
For example if you don't want to have whitespaces inside a token, then you can have a
|
||||
``PreTokenizer`` that splits on these whitespaces.
|
||||
|
||||
You can easily combine multiple ``PreTokenizer`` together using a ``Sequence`` (see below).
|
||||
The ``PreTokenizer`` is also allowed to modify the string, just like a ``Normalizer`` does. This
|
||||
is necessary to allow some complicated algorithms that require to split before normalizing (e.g.
|
||||
the ByteLevel)
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Name
|
||||
- Description
|
||||
- Example
|
||||
|
||||
* - ByteLevel
|
||||
- Splits on whitespaces while remapping all the bytes to a set of visible characters. This
|
||||
technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties:
|
||||
|
||||
- Since it maps on bytes, a tokenizer using this only requires **256** characters as initial
|
||||
alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode
|
||||
characters.
|
||||
- A consequence of the previous point is that it is absolutely unnecessary to have an
|
||||
unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)
|
||||
- For non ascii characters, it gets completely unreadable, but it works nonetheless!
|
||||
|
||||
- Input: ``"Hello my friend, how are you?"``
|
||||
|
||||
Ouput: ``"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"``
|
||||
|
||||
* - Whitespace
|
||||
- Splits on word boundaries (using the following regular expression: ``\w+|[^\w\s]+``
|
||||
- Input: ``"Hello there!"``
|
||||
|
||||
Output: ``"Hello", "there", "!"``
|
||||
|
||||
* - WhitespaceSplit
|
||||
- Splits on any whitespace character
|
||||
- Input: ``"Hello there!"``
|
||||
|
||||
Output: ``"Hello", "there!"``
|
||||
|
||||
* - Punctuation
|
||||
- Will isolate all punctuation characters
|
||||
- Input: ``"Hello?"``
|
||||
|
||||
Ouput: ``"Hello", "?"``
|
||||
|
||||
* - Metaspace
|
||||
- Splits on whitespaces and replaces them with a special char "▁" (U+2581)
|
||||
- Input: ``"Hello there"``
|
||||
|
||||
Ouput: ``"Hello", "▁there"``
|
||||
|
||||
* - CharDelimiterSplit
|
||||
- Splits on a given character
|
||||
- Example with ``x``:
|
||||
|
||||
Input: ``"Helloxthere"``
|
||||
|
||||
Ouput: ``"Hello", "there"``
|
||||
|
||||
* - Sequence
|
||||
- Lets you compose multiple ``PreTokenizer`` that will be run in the given order
|
||||
- ``Sequence([Punctuation(), WhitespaceSplit()])``
|
||||
|
||||
|
||||
Models
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
||||
Models are the core algorithms used to actually tokenize, and therefore, they are the only mandatory
|
||||
component of a Tokenizer.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Name
|
||||
- Description
|
||||
|
||||
* - WordLevel
|
||||
- This is the "classic" tokenization algorithm. It let's you simply map words to IDs
|
||||
without anything fancy. This has the advantage of being really simple to use and
|
||||
understand, but it requires extremely large vocabularies for a good coverage.
|
||||
|
||||
|
||||
*Using this* ``Model`` *requires the use of a* ``PreTokenizer``. *No choice will be made by
|
||||
this model directly, it simply maps input tokens to IDs*
|
||||
|
||||
* - BPE
|
||||
- One of the most popular subword tokenization algorithm. The Byte-Pair-Encoding works by
|
||||
starting with characters, while merging those that are the most frequently seen together,
|
||||
thus creating new tokens. It then works iteratively to build new tokens out of the most
|
||||
frequent pairs it sees in a corpus.
|
||||
|
||||
BPE is able to build words it has never seen by using multiple subword tokens, and thus
|
||||
requires smaller vocabularies, with less chances of having "unk" (unknown) tokens.
|
||||
|
||||
* - WordPiece
|
||||
- This is a subword tokenization algorithm quite similar to BPE, used mainly by Google in
|
||||
models like BERT. It uses a greedy algorithm, that tries to build long words first, splitting
|
||||
in multiple tokens when entire words don't exist in the vocabulary. This is different from
|
||||
BPE that starts from characters, building bigger tokens as possible.
|
||||
|
||||
It uses the famous ``##`` prefix to identify tokens that are part of a word (ie not starting
|
||||
a word).
|
||||
|
||||
* - Unigram
|
||||
- Unigram is also a subword tokenization algorithm, and works by trying to identify the best
|
||||
set of subword tokens to maximize the probability for a given sentence. This is different
|
||||
from BPE in the way that this is not deterministic based on a set of rules applied
|
||||
sequentially. Instead Unigram will be able to compute multiple ways of tokenizing, while
|
||||
choosing the most probable one.
|
||||
|
||||
|
||||
PostProcessor
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
||||
After the whole pipeline, we sometimes want to insert some special tokens before feed
|
||||
a tokenized string into a model like "[CLS] My horse is amazing [SEP]". The ``PostProcessor``
|
||||
is the component doing just that.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Name
|
||||
- Description
|
||||
- Example
|
||||
* - TemplateProcessing
|
||||
- Let's you easily template the post processing, adding special tokens, and specifying
|
||||
the ``type_id`` for each sequence/special token. The template is given two strings
|
||||
representing the single sequence and the pair of sequences, as well as a set of
|
||||
special tokens to use.
|
||||
- Example, when specifying a template with these values:
|
||||
|
||||
- single: ``"[CLS] $A [SEP]"``
|
||||
- pair: ``"[CLS] $A [SEP] $B [SEP]"``
|
||||
- special tokens:
|
||||
|
||||
- ``"[CLS]"``
|
||||
- ``"[SEP]"``
|
||||
|
||||
Input: ``("I like this", "but not this")``
|
||||
|
||||
Output: ``"[CLS] I like this [SEP] but not this [SEP]"``
|
||||
|
||||
|
||||
Decoders
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
||||
The Decoder knows how to go from the IDs used by the Tokenizer, back to a readable piece of text.
|
||||
Some ``Normalizer`` and ``PreTokenizer`` use special characters or identifiers that need to be
|
||||
reverted for example.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Name
|
||||
- Description
|
||||
* - ByteLevel
|
||||
- Reverts the ByteLevel PreTokenizer. This PreTokenizer encodes at the byte-level, using
|
||||
a set of visible Unicode characters to represent each byte, so we need a Decoder to
|
||||
revert this process and get something readable again.
|
||||
* - Metaspace
|
||||
- Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifer ``▁`` to
|
||||
identify whitespaces, and so this Decoder helps with decoding these.
|
||||
* - WordPiece
|
||||
- Reverts the WordPiece Model. This model uses a special identifier ``##`` for continuing
|
||||
subwords, and so this Decoder helps with decoding these.
|
||||
|
||||
|
Reference in New Issue
Block a user