Doc - Quick updates and typos

This commit is contained in:
Anthony MOI
2020-10-09 11:04:12 -04:00
committed by Anthony MOI
parent 403a028275
commit 12af3f2240
4 changed files with 32 additions and 72 deletions

View File

@ -4,6 +4,7 @@ Components
When building a Tokenizer, you can attach various types of components to this Tokenizer in order
to customize its behavior. This page lists most provided components.
.. _normalizers:
Normalizers
----------------------------------------------------------------------------------------------------
@ -71,6 +72,8 @@ The ``Normalizer`` is optional.
Sequence([Nmt(), NFKC()])
.. _pre-tokenizers:
Pre tokenizers
----------------------------------------------------------------------------------------------------
@ -144,6 +147,8 @@ the ByteLevel)
- ``Sequence([Punctuation(), WhitespaceSplit()])``
.. _models:
Models
----------------------------------------------------------------------------------------------------
@ -191,6 +196,8 @@ component of a Tokenizer.
choosing the most probable one.
.. _post-processors:
PostProcessor
----------------------------------------------------------------------------------------------------
@ -223,6 +230,8 @@ is the component doing just that.
Output: ``"[CLS] I like this [SEP] but not this [SEP]"``
.. _decoders:
Decoders
----------------------------------------------------------------------------------------------------

View File

@ -37,60 +37,3 @@ Main features:
:caption: API Reference
api/reference
Load an existing tokenizer:
----------------------------------------------------------------------------------------------------
Loading a previously saved tokenizer is extremely simple and requires a single line of code:
.. only:: rust
.. literalinclude:: ../../tokenizers/tests/documentation.rs
:language: rust
:start-after: START load_tokenizer
:end-before: END load_tokenizer
:dedent: 4
.. only:: python
.. literalinclude:: ../../bindings/python/tests/documentation/test_load.py
:language: python
:start-after: START load_tokenizer
:end-before: END load_tokenizer
:dedent: 4
.. only:: node
.. literalinclude:: ../../bindings/node/examples/load.test.js
:language: javascript
:start-after: START load
:end-before: END load
:dedent: 4
Train a tokenizer:
----------------------------------------------------------------------------------------------------
.. only:: rust
.. literalinclude:: ../../tokenizers/tests/documentation.rs
:language: rust
:start-after: START train_tokenizer
:end-before: END train_tokenizer
:dedent: 4
.. only:: python
.. literalinclude:: ../../bindings/python/tests/documentation/test_train.py
:language: python
:start-after: START train_tokenizer
:end-before: END train_tokenizer
:dedent: 4
.. only:: node
.. literalinclude:: ../../bindings/node/examples/train.test.js
:language: javascript
:start-after: START train_tokenizer
:end-before: END train_tokenizer
:dedent: 4

View File

@ -31,7 +31,7 @@ Normalization
Normalization is, in a nutshell, a set of operations you apply to a raw string to make it less
random or "cleaner". Common operations include stripping whitespace, removing accented characters
or lowercasing all text. If you're familiar with `unicode normalization
or lowercasing all text. If you're familiar with `Unicode normalization
<https://unicode.org/reports/tr15>`__, it is also a very common normalization operation applied
in most tokenizers.
@ -102,7 +102,7 @@ numbers in their individual digits:
from tokenizers.pre_tokenizers import Digits
pre_tokenizer = tokenizers.pre_tokenizers.Sequence([
Whitespace(),
Whitespace(),
Digits(individual_digits=True),
])
pre_tokenizer.pre_tokenize_str("Call 911!")
@ -128,16 +128,18 @@ Once the input texts are normalized and pre-tokenized, we can apply the model on
This is the part of the pipeline that needs training on your corpus (or that has been trained if you
are using a pretrained tokenizer).
The role of the models is to split your "words" into tokens, using the rules it has learned. It's
The role of the model is to split your "words" into tokens, using the rules it has learned. It's
also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model.
This model is passed along when intializing the :class:`~tokenizers.Tokenizer` so you already know
how to customize this part. Currently, the 🤗 Tokenizers library supports:
- :class:`~tokenizers.models.BPE` (Byte-Pair Encoding)
- :class:`~tokenizers.models.Unigram` (for SentencePiece tokenizers)
- :class:`~tokenizers.models.WordLevel` (for just returning the result of the pre-tokenization)
- :class:`~tokenizers.models.WordPiece` (the classic BERT tokenizer)
- :class:`~tokenizers.models.BPE`
- :class:`~tokenizers.models.Unigram`
- :class:`~tokenizers.models.WordLevel`
- :class:`~tokenizers.models.WordPiece`
For more details about each model and its behavior, you can check `here <components.html#models>`__
.. _post-processing:
@ -210,7 +212,10 @@ And the post-processing uses the template we saw in the previous section:
bert_tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[("[CLS]", 1), ("[SEP]", 2)],
special_tokens=[
("[CLS]", bert_tokenizer.token_to_id("[CLS]")),
("[SEP]", bert_tokenizer.token_to_id("[SEP]"))
],
)
We can use this tokenizer and train on it on wikitext like in the :doc:`quicktour`:

View File

@ -24,7 +24,7 @@ with:
Training the tokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenzier. For more information
In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenizer. For more information
about the different type of tokenizers, check out this `guide
<https://huggingface.co/transformers/tokenizer_summary.html>`__ in the 🤗 Transformers
documentation. Here, training the tokenizer means it will learn merge rules by:
@ -84,7 +84,7 @@ to use:
tokenizer.train(trainer, files)
This should only take a few seconds to train our tokenizer on the full wikitext dataset! Once this
is done, we need to save the model and reinstantiate it with the unkown token, or this token won't
is done, we need to save the model and reinstantiate it with the unknown token, or this token won't
be used. This will be simplified in a further release, to let you set the :obj:`unk_token` when
first instantiating the model.
@ -100,7 +100,7 @@ To save the tokenizer in one file that contains all its configuration and vocabu
tokenizer.save("pretrained/wiki.json")
and you can reload your tokenzier from that file with the :meth:`~tokenizers.Tokenizer.from_file`
and you can reload your tokenizer from that file with the :meth:`~tokenizers.Tokenizer.from_file`
class method:
.. code-block:: python
@ -119,7 +119,7 @@ Now that we have trained a tokenizer, we can use it on any text we want with the
This applied the full pipeline of the tokenizer on the text, returning an
:class:`~tokenizers.Encoding` object. To learn more about this pipeline, and how to apply (or
customize) parts of it, check out :doc:`this apge <pipeline>`.
customize) parts of it, check out :doc:`this page <pipeline>`.
This :class:`~tokenizers.Encoding` object then has all the attributes you need for your deep
learning model (or other). The :obj:`tokens` attribute contains the segmentation of your text in
@ -138,7 +138,7 @@ tokenizer's vocabulary:
print(output.ids)
# [27194, 16, 93, 11, 5068, 5, 7928, 5083, 6190, 0, 35]
An important feature of the 🤗 Tokenizers library is that it comes with full alignmenbt tracking,
An important feature of the 🤗 Tokenizers library is that it comes with full alignment tracking,
meaning you can always get the part of your original sentence that corresponds to a given token.
Those are stored in the :obj:`offsets` attribute of our :class:`~tokenizers.Encoding` object. For
instance, let's assume we would want to find back what caused the :obj:`"[UNK]"` token to appear,
@ -149,7 +149,7 @@ which is the token at index 9 in the list, we can just ask for the offset at the
print(output.offsets[9])
# (26, 27)
and those are the indices that correspond to the smiler in the original sentence:
and those are the indices that correspond to the emoji in the original sentence:
.. code-block:: python
@ -183,7 +183,10 @@ Here is how we can set the post-processing to give us the traditional BERT input
tokenizer.post_processor = TemplateProcessing
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[("[CLS]", 1), ("[SEP]", 2)],
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]"))
],
)
Let's go over this snippet of code in more details. First we specify the template for single