mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Doc - Quick updates and typos
This commit is contained in:
@ -4,6 +4,7 @@ Components
|
||||
When building a Tokenizer, you can attach various types of components to this Tokenizer in order
|
||||
to customize its behavior. This page lists most provided components.
|
||||
|
||||
.. _normalizers:
|
||||
|
||||
Normalizers
|
||||
----------------------------------------------------------------------------------------------------
|
||||
@ -71,6 +72,8 @@ The ``Normalizer`` is optional.
|
||||
Sequence([Nmt(), NFKC()])
|
||||
|
||||
|
||||
.. _pre-tokenizers:
|
||||
|
||||
Pre tokenizers
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
||||
@ -144,6 +147,8 @@ the ByteLevel)
|
||||
- ``Sequence([Punctuation(), WhitespaceSplit()])``
|
||||
|
||||
|
||||
.. _models:
|
||||
|
||||
Models
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
||||
@ -191,6 +196,8 @@ component of a Tokenizer.
|
||||
choosing the most probable one.
|
||||
|
||||
|
||||
.. _post-processors:
|
||||
|
||||
PostProcessor
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
||||
@ -223,6 +230,8 @@ is the component doing just that.
|
||||
Output: ``"[CLS] I like this [SEP] but not this [SEP]"``
|
||||
|
||||
|
||||
.. _decoders:
|
||||
|
||||
Decoders
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
||||
|
@ -37,60 +37,3 @@ Main features:
|
||||
:caption: API Reference
|
||||
|
||||
api/reference
|
||||
|
||||
Load an existing tokenizer:
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
||||
Loading a previously saved tokenizer is extremely simple and requires a single line of code:
|
||||
|
||||
.. only:: rust
|
||||
|
||||
.. literalinclude:: ../../tokenizers/tests/documentation.rs
|
||||
:language: rust
|
||||
:start-after: START load_tokenizer
|
||||
:end-before: END load_tokenizer
|
||||
:dedent: 4
|
||||
|
||||
.. only:: python
|
||||
|
||||
.. literalinclude:: ../../bindings/python/tests/documentation/test_load.py
|
||||
:language: python
|
||||
:start-after: START load_tokenizer
|
||||
:end-before: END load_tokenizer
|
||||
:dedent: 4
|
||||
|
||||
.. only:: node
|
||||
|
||||
.. literalinclude:: ../../bindings/node/examples/load.test.js
|
||||
:language: javascript
|
||||
:start-after: START load
|
||||
:end-before: END load
|
||||
:dedent: 4
|
||||
|
||||
|
||||
Train a tokenizer:
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
||||
.. only:: rust
|
||||
|
||||
.. literalinclude:: ../../tokenizers/tests/documentation.rs
|
||||
:language: rust
|
||||
:start-after: START train_tokenizer
|
||||
:end-before: END train_tokenizer
|
||||
:dedent: 4
|
||||
|
||||
.. only:: python
|
||||
|
||||
.. literalinclude:: ../../bindings/python/tests/documentation/test_train.py
|
||||
:language: python
|
||||
:start-after: START train_tokenizer
|
||||
:end-before: END train_tokenizer
|
||||
:dedent: 4
|
||||
|
||||
.. only:: node
|
||||
|
||||
.. literalinclude:: ../../bindings/node/examples/train.test.js
|
||||
:language: javascript
|
||||
:start-after: START train_tokenizer
|
||||
:end-before: END train_tokenizer
|
||||
:dedent: 4
|
||||
|
@ -31,7 +31,7 @@ Normalization
|
||||
|
||||
Normalization is, in a nutshell, a set of operations you apply to a raw string to make it less
|
||||
random or "cleaner". Common operations include stripping whitespace, removing accented characters
|
||||
or lowercasing all text. If you're familiar with `unicode normalization
|
||||
or lowercasing all text. If you're familiar with `Unicode normalization
|
||||
<https://unicode.org/reports/tr15>`__, it is also a very common normalization operation applied
|
||||
in most tokenizers.
|
||||
|
||||
@ -102,7 +102,7 @@ numbers in their individual digits:
|
||||
from tokenizers.pre_tokenizers import Digits
|
||||
|
||||
pre_tokenizer = tokenizers.pre_tokenizers.Sequence([
|
||||
Whitespace(),
|
||||
Whitespace(),
|
||||
Digits(individual_digits=True),
|
||||
])
|
||||
pre_tokenizer.pre_tokenize_str("Call 911!")
|
||||
@ -128,16 +128,18 @@ Once the input texts are normalized and pre-tokenized, we can apply the model on
|
||||
This is the part of the pipeline that needs training on your corpus (or that has been trained if you
|
||||
are using a pretrained tokenizer).
|
||||
|
||||
The role of the models is to split your "words" into tokens, using the rules it has learned. It's
|
||||
The role of the model is to split your "words" into tokens, using the rules it has learned. It's
|
||||
also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model.
|
||||
|
||||
This model is passed along when intializing the :class:`~tokenizers.Tokenizer` so you already know
|
||||
how to customize this part. Currently, the 🤗 Tokenizers library supports:
|
||||
|
||||
- :class:`~tokenizers.models.BPE` (Byte-Pair Encoding)
|
||||
- :class:`~tokenizers.models.Unigram` (for SentencePiece tokenizers)
|
||||
- :class:`~tokenizers.models.WordLevel` (for just returning the result of the pre-tokenization)
|
||||
- :class:`~tokenizers.models.WordPiece` (the classic BERT tokenizer)
|
||||
- :class:`~tokenizers.models.BPE`
|
||||
- :class:`~tokenizers.models.Unigram`
|
||||
- :class:`~tokenizers.models.WordLevel`
|
||||
- :class:`~tokenizers.models.WordPiece`
|
||||
|
||||
For more details about each model and its behavior, you can check `here <components.html#models>`__
|
||||
|
||||
|
||||
.. _post-processing:
|
||||
@ -210,7 +212,10 @@ And the post-processing uses the template we saw in the previous section:
|
||||
bert_tokenizer.post_processor = TemplateProcessing(
|
||||
single="[CLS] $A [SEP]",
|
||||
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
|
||||
special_tokens=[("[CLS]", 1), ("[SEP]", 2)],
|
||||
special_tokens=[
|
||||
("[CLS]", bert_tokenizer.token_to_id("[CLS]")),
|
||||
("[SEP]", bert_tokenizer.token_to_id("[SEP]"))
|
||||
],
|
||||
)
|
||||
|
||||
We can use this tokenizer and train on it on wikitext like in the :doc:`quicktour`:
|
||||
|
@ -24,7 +24,7 @@ with:
|
||||
Training the tokenizer
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenzier. For more information
|
||||
In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenizer. For more information
|
||||
about the different type of tokenizers, check out this `guide
|
||||
<https://huggingface.co/transformers/tokenizer_summary.html>`__ in the 🤗 Transformers
|
||||
documentation. Here, training the tokenizer means it will learn merge rules by:
|
||||
@ -84,7 +84,7 @@ to use:
|
||||
tokenizer.train(trainer, files)
|
||||
|
||||
This should only take a few seconds to train our tokenizer on the full wikitext dataset! Once this
|
||||
is done, we need to save the model and reinstantiate it with the unkown token, or this token won't
|
||||
is done, we need to save the model and reinstantiate it with the unknown token, or this token won't
|
||||
be used. This will be simplified in a further release, to let you set the :obj:`unk_token` when
|
||||
first instantiating the model.
|
||||
|
||||
@ -100,7 +100,7 @@ To save the tokenizer in one file that contains all its configuration and vocabu
|
||||
|
||||
tokenizer.save("pretrained/wiki.json")
|
||||
|
||||
and you can reload your tokenzier from that file with the :meth:`~tokenizers.Tokenizer.from_file`
|
||||
and you can reload your tokenizer from that file with the :meth:`~tokenizers.Tokenizer.from_file`
|
||||
class method:
|
||||
|
||||
.. code-block:: python
|
||||
@ -119,7 +119,7 @@ Now that we have trained a tokenizer, we can use it on any text we want with the
|
||||
|
||||
This applied the full pipeline of the tokenizer on the text, returning an
|
||||
:class:`~tokenizers.Encoding` object. To learn more about this pipeline, and how to apply (or
|
||||
customize) parts of it, check out :doc:`this apge <pipeline>`.
|
||||
customize) parts of it, check out :doc:`this page <pipeline>`.
|
||||
|
||||
This :class:`~tokenizers.Encoding` object then has all the attributes you need for your deep
|
||||
learning model (or other). The :obj:`tokens` attribute contains the segmentation of your text in
|
||||
@ -138,7 +138,7 @@ tokenizer's vocabulary:
|
||||
print(output.ids)
|
||||
# [27194, 16, 93, 11, 5068, 5, 7928, 5083, 6190, 0, 35]
|
||||
|
||||
An important feature of the 🤗 Tokenizers library is that it comes with full alignmenbt tracking,
|
||||
An important feature of the 🤗 Tokenizers library is that it comes with full alignment tracking,
|
||||
meaning you can always get the part of your original sentence that corresponds to a given token.
|
||||
Those are stored in the :obj:`offsets` attribute of our :class:`~tokenizers.Encoding` object. For
|
||||
instance, let's assume we would want to find back what caused the :obj:`"[UNK]"` token to appear,
|
||||
@ -149,7 +149,7 @@ which is the token at index 9 in the list, we can just ask for the offset at the
|
||||
print(output.offsets[9])
|
||||
# (26, 27)
|
||||
|
||||
and those are the indices that correspond to the smiler in the original sentence:
|
||||
and those are the indices that correspond to the emoji in the original sentence:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@ -183,7 +183,10 @@ Here is how we can set the post-processing to give us the traditional BERT input
|
||||
tokenizer.post_processor = TemplateProcessing
|
||||
single="[CLS] $A [SEP]",
|
||||
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
|
||||
special_tokens=[("[CLS]", 1), ("[SEP]", 2)],
|
||||
special_tokens=[
|
||||
("[CLS]", tokenizer.token_to_id("[CLS]")),
|
||||
("[SEP]", tokenizer.token_to_id("[SEP]"))
|
||||
],
|
||||
)
|
||||
|
||||
Let's go over this snippet of code in more details. First we specify the template for single
|
||||
|
Reference in New Issue
Block a user