mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Doc - Quick updates and typos
This commit is contained in:
@ -4,6 +4,7 @@ Components
|
|||||||
When building a Tokenizer, you can attach various types of components to this Tokenizer in order
|
When building a Tokenizer, you can attach various types of components to this Tokenizer in order
|
||||||
to customize its behavior. This page lists most provided components.
|
to customize its behavior. This page lists most provided components.
|
||||||
|
|
||||||
|
.. _normalizers:
|
||||||
|
|
||||||
Normalizers
|
Normalizers
|
||||||
----------------------------------------------------------------------------------------------------
|
----------------------------------------------------------------------------------------------------
|
||||||
@ -71,6 +72,8 @@ The ``Normalizer`` is optional.
|
|||||||
Sequence([Nmt(), NFKC()])
|
Sequence([Nmt(), NFKC()])
|
||||||
|
|
||||||
|
|
||||||
|
.. _pre-tokenizers:
|
||||||
|
|
||||||
Pre tokenizers
|
Pre tokenizers
|
||||||
----------------------------------------------------------------------------------------------------
|
----------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
@ -144,6 +147,8 @@ the ByteLevel)
|
|||||||
- ``Sequence([Punctuation(), WhitespaceSplit()])``
|
- ``Sequence([Punctuation(), WhitespaceSplit()])``
|
||||||
|
|
||||||
|
|
||||||
|
.. _models:
|
||||||
|
|
||||||
Models
|
Models
|
||||||
----------------------------------------------------------------------------------------------------
|
----------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
@ -191,6 +196,8 @@ component of a Tokenizer.
|
|||||||
choosing the most probable one.
|
choosing the most probable one.
|
||||||
|
|
||||||
|
|
||||||
|
.. _post-processors:
|
||||||
|
|
||||||
PostProcessor
|
PostProcessor
|
||||||
----------------------------------------------------------------------------------------------------
|
----------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
@ -223,6 +230,8 @@ is the component doing just that.
|
|||||||
Output: ``"[CLS] I like this [SEP] but not this [SEP]"``
|
Output: ``"[CLS] I like this [SEP] but not this [SEP]"``
|
||||||
|
|
||||||
|
|
||||||
|
.. _decoders:
|
||||||
|
|
||||||
Decoders
|
Decoders
|
||||||
----------------------------------------------------------------------------------------------------
|
----------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -37,60 +37,3 @@ Main features:
|
|||||||
:caption: API Reference
|
:caption: API Reference
|
||||||
|
|
||||||
api/reference
|
api/reference
|
||||||
|
|
||||||
Load an existing tokenizer:
|
|
||||||
----------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Loading a previously saved tokenizer is extremely simple and requires a single line of code:
|
|
||||||
|
|
||||||
.. only:: rust
|
|
||||||
|
|
||||||
.. literalinclude:: ../../tokenizers/tests/documentation.rs
|
|
||||||
:language: rust
|
|
||||||
:start-after: START load_tokenizer
|
|
||||||
:end-before: END load_tokenizer
|
|
||||||
:dedent: 4
|
|
||||||
|
|
||||||
.. only:: python
|
|
||||||
|
|
||||||
.. literalinclude:: ../../bindings/python/tests/documentation/test_load.py
|
|
||||||
:language: python
|
|
||||||
:start-after: START load_tokenizer
|
|
||||||
:end-before: END load_tokenizer
|
|
||||||
:dedent: 4
|
|
||||||
|
|
||||||
.. only:: node
|
|
||||||
|
|
||||||
.. literalinclude:: ../../bindings/node/examples/load.test.js
|
|
||||||
:language: javascript
|
|
||||||
:start-after: START load
|
|
||||||
:end-before: END load
|
|
||||||
:dedent: 4
|
|
||||||
|
|
||||||
|
|
||||||
Train a tokenizer:
|
|
||||||
----------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
.. only:: rust
|
|
||||||
|
|
||||||
.. literalinclude:: ../../tokenizers/tests/documentation.rs
|
|
||||||
:language: rust
|
|
||||||
:start-after: START train_tokenizer
|
|
||||||
:end-before: END train_tokenizer
|
|
||||||
:dedent: 4
|
|
||||||
|
|
||||||
.. only:: python
|
|
||||||
|
|
||||||
.. literalinclude:: ../../bindings/python/tests/documentation/test_train.py
|
|
||||||
:language: python
|
|
||||||
:start-after: START train_tokenizer
|
|
||||||
:end-before: END train_tokenizer
|
|
||||||
:dedent: 4
|
|
||||||
|
|
||||||
.. only:: node
|
|
||||||
|
|
||||||
.. literalinclude:: ../../bindings/node/examples/train.test.js
|
|
||||||
:language: javascript
|
|
||||||
:start-after: START train_tokenizer
|
|
||||||
:end-before: END train_tokenizer
|
|
||||||
:dedent: 4
|
|
||||||
|
@ -31,7 +31,7 @@ Normalization
|
|||||||
|
|
||||||
Normalization is, in a nutshell, a set of operations you apply to a raw string to make it less
|
Normalization is, in a nutshell, a set of operations you apply to a raw string to make it less
|
||||||
random or "cleaner". Common operations include stripping whitespace, removing accented characters
|
random or "cleaner". Common operations include stripping whitespace, removing accented characters
|
||||||
or lowercasing all text. If you're familiar with `unicode normalization
|
or lowercasing all text. If you're familiar with `Unicode normalization
|
||||||
<https://unicode.org/reports/tr15>`__, it is also a very common normalization operation applied
|
<https://unicode.org/reports/tr15>`__, it is also a very common normalization operation applied
|
||||||
in most tokenizers.
|
in most tokenizers.
|
||||||
|
|
||||||
@ -102,7 +102,7 @@ numbers in their individual digits:
|
|||||||
from tokenizers.pre_tokenizers import Digits
|
from tokenizers.pre_tokenizers import Digits
|
||||||
|
|
||||||
pre_tokenizer = tokenizers.pre_tokenizers.Sequence([
|
pre_tokenizer = tokenizers.pre_tokenizers.Sequence([
|
||||||
Whitespace(),
|
Whitespace(),
|
||||||
Digits(individual_digits=True),
|
Digits(individual_digits=True),
|
||||||
])
|
])
|
||||||
pre_tokenizer.pre_tokenize_str("Call 911!")
|
pre_tokenizer.pre_tokenize_str("Call 911!")
|
||||||
@ -128,16 +128,18 @@ Once the input texts are normalized and pre-tokenized, we can apply the model on
|
|||||||
This is the part of the pipeline that needs training on your corpus (or that has been trained if you
|
This is the part of the pipeline that needs training on your corpus (or that has been trained if you
|
||||||
are using a pretrained tokenizer).
|
are using a pretrained tokenizer).
|
||||||
|
|
||||||
The role of the models is to split your "words" into tokens, using the rules it has learned. It's
|
The role of the model is to split your "words" into tokens, using the rules it has learned. It's
|
||||||
also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model.
|
also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model.
|
||||||
|
|
||||||
This model is passed along when intializing the :class:`~tokenizers.Tokenizer` so you already know
|
This model is passed along when intializing the :class:`~tokenizers.Tokenizer` so you already know
|
||||||
how to customize this part. Currently, the 🤗 Tokenizers library supports:
|
how to customize this part. Currently, the 🤗 Tokenizers library supports:
|
||||||
|
|
||||||
- :class:`~tokenizers.models.BPE` (Byte-Pair Encoding)
|
- :class:`~tokenizers.models.BPE`
|
||||||
- :class:`~tokenizers.models.Unigram` (for SentencePiece tokenizers)
|
- :class:`~tokenizers.models.Unigram`
|
||||||
- :class:`~tokenizers.models.WordLevel` (for just returning the result of the pre-tokenization)
|
- :class:`~tokenizers.models.WordLevel`
|
||||||
- :class:`~tokenizers.models.WordPiece` (the classic BERT tokenizer)
|
- :class:`~tokenizers.models.WordPiece`
|
||||||
|
|
||||||
|
For more details about each model and its behavior, you can check `here <components.html#models>`__
|
||||||
|
|
||||||
|
|
||||||
.. _post-processing:
|
.. _post-processing:
|
||||||
@ -210,7 +212,10 @@ And the post-processing uses the template we saw in the previous section:
|
|||||||
bert_tokenizer.post_processor = TemplateProcessing(
|
bert_tokenizer.post_processor = TemplateProcessing(
|
||||||
single="[CLS] $A [SEP]",
|
single="[CLS] $A [SEP]",
|
||||||
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
|
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
|
||||||
special_tokens=[("[CLS]", 1), ("[SEP]", 2)],
|
special_tokens=[
|
||||||
|
("[CLS]", bert_tokenizer.token_to_id("[CLS]")),
|
||||||
|
("[SEP]", bert_tokenizer.token_to_id("[SEP]"))
|
||||||
|
],
|
||||||
)
|
)
|
||||||
|
|
||||||
We can use this tokenizer and train on it on wikitext like in the :doc:`quicktour`:
|
We can use this tokenizer and train on it on wikitext like in the :doc:`quicktour`:
|
||||||
|
@ -24,7 +24,7 @@ with:
|
|||||||
Training the tokenizer
|
Training the tokenizer
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenzier. For more information
|
In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenizer. For more information
|
||||||
about the different type of tokenizers, check out this `guide
|
about the different type of tokenizers, check out this `guide
|
||||||
<https://huggingface.co/transformers/tokenizer_summary.html>`__ in the 🤗 Transformers
|
<https://huggingface.co/transformers/tokenizer_summary.html>`__ in the 🤗 Transformers
|
||||||
documentation. Here, training the tokenizer means it will learn merge rules by:
|
documentation. Here, training the tokenizer means it will learn merge rules by:
|
||||||
@ -84,7 +84,7 @@ to use:
|
|||||||
tokenizer.train(trainer, files)
|
tokenizer.train(trainer, files)
|
||||||
|
|
||||||
This should only take a few seconds to train our tokenizer on the full wikitext dataset! Once this
|
This should only take a few seconds to train our tokenizer on the full wikitext dataset! Once this
|
||||||
is done, we need to save the model and reinstantiate it with the unkown token, or this token won't
|
is done, we need to save the model and reinstantiate it with the unknown token, or this token won't
|
||||||
be used. This will be simplified in a further release, to let you set the :obj:`unk_token` when
|
be used. This will be simplified in a further release, to let you set the :obj:`unk_token` when
|
||||||
first instantiating the model.
|
first instantiating the model.
|
||||||
|
|
||||||
@ -100,7 +100,7 @@ To save the tokenizer in one file that contains all its configuration and vocabu
|
|||||||
|
|
||||||
tokenizer.save("pretrained/wiki.json")
|
tokenizer.save("pretrained/wiki.json")
|
||||||
|
|
||||||
and you can reload your tokenzier from that file with the :meth:`~tokenizers.Tokenizer.from_file`
|
and you can reload your tokenizer from that file with the :meth:`~tokenizers.Tokenizer.from_file`
|
||||||
class method:
|
class method:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
@ -119,7 +119,7 @@ Now that we have trained a tokenizer, we can use it on any text we want with the
|
|||||||
|
|
||||||
This applied the full pipeline of the tokenizer on the text, returning an
|
This applied the full pipeline of the tokenizer on the text, returning an
|
||||||
:class:`~tokenizers.Encoding` object. To learn more about this pipeline, and how to apply (or
|
:class:`~tokenizers.Encoding` object. To learn more about this pipeline, and how to apply (or
|
||||||
customize) parts of it, check out :doc:`this apge <pipeline>`.
|
customize) parts of it, check out :doc:`this page <pipeline>`.
|
||||||
|
|
||||||
This :class:`~tokenizers.Encoding` object then has all the attributes you need for your deep
|
This :class:`~tokenizers.Encoding` object then has all the attributes you need for your deep
|
||||||
learning model (or other). The :obj:`tokens` attribute contains the segmentation of your text in
|
learning model (or other). The :obj:`tokens` attribute contains the segmentation of your text in
|
||||||
@ -138,7 +138,7 @@ tokenizer's vocabulary:
|
|||||||
print(output.ids)
|
print(output.ids)
|
||||||
# [27194, 16, 93, 11, 5068, 5, 7928, 5083, 6190, 0, 35]
|
# [27194, 16, 93, 11, 5068, 5, 7928, 5083, 6190, 0, 35]
|
||||||
|
|
||||||
An important feature of the 🤗 Tokenizers library is that it comes with full alignmenbt tracking,
|
An important feature of the 🤗 Tokenizers library is that it comes with full alignment tracking,
|
||||||
meaning you can always get the part of your original sentence that corresponds to a given token.
|
meaning you can always get the part of your original sentence that corresponds to a given token.
|
||||||
Those are stored in the :obj:`offsets` attribute of our :class:`~tokenizers.Encoding` object. For
|
Those are stored in the :obj:`offsets` attribute of our :class:`~tokenizers.Encoding` object. For
|
||||||
instance, let's assume we would want to find back what caused the :obj:`"[UNK]"` token to appear,
|
instance, let's assume we would want to find back what caused the :obj:`"[UNK]"` token to appear,
|
||||||
@ -149,7 +149,7 @@ which is the token at index 9 in the list, we can just ask for the offset at the
|
|||||||
print(output.offsets[9])
|
print(output.offsets[9])
|
||||||
# (26, 27)
|
# (26, 27)
|
||||||
|
|
||||||
and those are the indices that correspond to the smiler in the original sentence:
|
and those are the indices that correspond to the emoji in the original sentence:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
@ -183,7 +183,10 @@ Here is how we can set the post-processing to give us the traditional BERT input
|
|||||||
tokenizer.post_processor = TemplateProcessing
|
tokenizer.post_processor = TemplateProcessing
|
||||||
single="[CLS] $A [SEP]",
|
single="[CLS] $A [SEP]",
|
||||||
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
|
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
|
||||||
special_tokens=[("[CLS]", 1), ("[SEP]", 2)],
|
special_tokens=[
|
||||||
|
("[CLS]", tokenizer.token_to_id("[CLS]")),
|
||||||
|
("[SEP]", tokenizer.token_to_id("[SEP]"))
|
||||||
|
],
|
||||||
)
|
)
|
||||||
|
|
||||||
Let's go over this snippet of code in more details. First we specify the template for single
|
Let's go over this snippet of code in more details. First we specify the template for single
|
||||||
|
Reference in New Issue
Block a user