mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
838 lines
23 KiB
Plaintext
838 lines
23 KiB
Plaintext
# Quicktour
|
|
|
|
Let's have a quick look at the 🤗 Tokenizers library features. The
|
|
library provides an implementation of today's most used tokenizers that
|
|
is both easy to use and blazing fast.
|
|
|
|
## Build a tokenizer from scratch
|
|
|
|
To illustrate how fast the 🤗 Tokenizers library is, let's train a new
|
|
tokenizer on [wikitext-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
|
|
(516M of text) in just a few seconds. First things first, you will need
|
|
to download this dataset and unzip it with:
|
|
|
|
``` bash
|
|
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
|
|
unzip wikitext-103-raw-v1.zip
|
|
```
|
|
|
|
### Training the tokenizer
|
|
|
|
In this tour, we will build and train a Byte-Pair Encoding (BPE)
|
|
tokenizer. For more information about the different type of tokenizers,
|
|
check out this [guide](https://huggingface.co/transformers/tokenizer_summary.html) in
|
|
the 🤗 Transformers documentation. Here, training the tokenizer means it
|
|
will learn merge rules by:
|
|
|
|
- Start with all the characters present in the training corpus as
|
|
tokens.
|
|
- Identify the most common pair of tokens and merge it into one token.
|
|
- Repeat until the vocabulary (e.g., the number of tokens) has reached
|
|
the size we want.
|
|
|
|
The main API of the library is the `class` `Tokenizer`, here is how
|
|
we instantiate one with a BPE model:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START init_tokenizer",
|
|
"end-before": "END init_tokenizer",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_init_tokenizer",
|
|
"end-before": "END quicktour_init_tokenizer",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START init_tokenizer",
|
|
"end-before": "END init_tokenizer",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
To train our tokenizer on the wikitext files, we will need to
|
|
instantiate a [trainer]{.title-ref}, in this case a
|
|
`BpeTrainer`
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START init_trainer",
|
|
"end-before": "END init_trainer",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_init_trainer",
|
|
"end-before": "END quicktour_init_trainer",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START init_trainer",
|
|
"end-before": "END init_trainer",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
We can set the training arguments like `vocab_size` or `min_frequency` (here
|
|
left at their default values of 30,000 and 0) but the most important
|
|
part is to give the `special_tokens` we
|
|
plan to use later on (they are not used at all during training) so that
|
|
they get inserted in the vocabulary.
|
|
|
|
<Tip>
|
|
|
|
The order in which you write the special tokens list matters: here `"[UNK]"` will get the ID 0,
|
|
`"[CLS]"` will get the ID 1 and so forth.
|
|
|
|
</Tip>
|
|
|
|
We could train our tokenizer right now, but it wouldn't be optimal.
|
|
Without a pre-tokenizer that will split our inputs into words, we might
|
|
get tokens that overlap several words: for instance we could get an
|
|
`"it is"` token since those two words
|
|
often appear next to each other. Using a pre-tokenizer will ensure no
|
|
token is bigger than a word returned by the pre-tokenizer. Here we want
|
|
to train a subword BPE tokenizer, and we will use the easiest
|
|
pre-tokenizer possible by splitting on whitespace.
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START init_pretok",
|
|
"end-before": "END init_pretok",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_init_pretok",
|
|
"end-before": "END quicktour_init_pretok",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START init_pretok",
|
|
"end-before": "END init_pretok",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
Now, we can just call the `Tokenizer.train` method with any list of files we want to use:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START train",
|
|
"end-before": "END train",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_train",
|
|
"end-before": "END quicktour_train",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START train",
|
|
"end-before": "END train",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
This should only take a few seconds to train our tokenizer on the full
|
|
wikitext dataset! To save the tokenizer in one file that contains all
|
|
its configuration and vocabulary, just use the
|
|
`Tokenizer.save` method:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START save",
|
|
"end-before": "END save",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_save",
|
|
"end-before": "END quicktour_save",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START save",
|
|
"end-before": "END save",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
and you can reload your tokenizer from that file with the
|
|
`Tokenizer.from_file`
|
|
`classmethod`:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START reload_tokenizer",
|
|
"end-before": "END reload_tokenizer",
|
|
"dedent": 12}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_reload_tokenizer",
|
|
"end-before": "END quicktour_reload_tokenizer",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START reload_tokenizer",
|
|
"end-before": "END reload_tokenizer",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
### Using the tokenizer
|
|
|
|
Now that we have trained a tokenizer, we can use it on any text we want
|
|
with the `Tokenizer.encode` method:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START encode",
|
|
"end-before": "END encode",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_encode",
|
|
"end-before": "END quicktour_encode",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START encode",
|
|
"end-before": "END encode",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
This applied the full pipeline of the tokenizer on the text, returning
|
|
an `Encoding` object. To learn more
|
|
about this pipeline, and how to apply (or customize) parts of it, check out `this page <pipeline>`.
|
|
|
|
This `Encoding` object then has all the
|
|
attributes you need for your deep learning model (or other). The
|
|
`tokens` attribute contains the
|
|
segmentation of your text in tokens:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START print_tokens",
|
|
"end-before": "END print_tokens",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_print_tokens",
|
|
"end-before": "END quicktour_print_tokens",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START print_tokens",
|
|
"end-before": "END print_tokens",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
Similarly, the `ids` attribute will
|
|
contain the index of each of those tokens in the tokenizer's
|
|
vocabulary:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START print_ids",
|
|
"end-before": "END print_ids",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_print_ids",
|
|
"end-before": "END quicktour_print_ids",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START print_ids",
|
|
"end-before": "END print_ids",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
An important feature of the 🤗 Tokenizers library is that it comes with
|
|
full alignment tracking, meaning you can always get the part of your
|
|
original sentence that corresponds to a given token. Those are stored in
|
|
the `offsets` attribute of our
|
|
`Encoding` object. For instance, let's
|
|
assume we would want to find back what caused the
|
|
`"[UNK]"` token to appear, which is the
|
|
token at index 9 in the list, we can just ask for the offset at the
|
|
index:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START print_offsets",
|
|
"end-before": "END print_offsets",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_print_offsets",
|
|
"end-before": "END quicktour_print_offsets",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START print_offsets",
|
|
"end-before": "END print_offsets",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
and those are the indices that correspond to the emoji in the original
|
|
sentence:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START use_offsets",
|
|
"end-before": "END use_offsets",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_use_offsets",
|
|
"end-before": "END quicktour_use_offsets",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START use_offsets",
|
|
"end-before": "END use_offsets",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
### Post-processing
|
|
|
|
We might want our tokenizer to automatically add special tokens, like
|
|
`"[CLS]"` or `"[SEP]"`. To do this, we use a post-processor.
|
|
`TemplateProcessing` is the most
|
|
commonly used, you just have to specify a template for the processing of
|
|
single sentences and pairs of sentences, along with the special tokens
|
|
and their IDs.
|
|
|
|
When we built our tokenizer, we set `"[CLS]"` and `"[SEP]"` in positions 1
|
|
and 2 of our list of special tokens, so this should be their IDs. To
|
|
double-check, we can use the `Tokenizer.token_to_id` method:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START check_sep",
|
|
"end-before": "END check_sep",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_check_sep",
|
|
"end-before": "END quicktour_check_sep",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START check_sep",
|
|
"end-before": "END check_sep",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
Here is how we can set the post-processing to give us the traditional
|
|
BERT inputs:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START init_template_processing",
|
|
"end-before": "END init_template_processing",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_init_template_processing",
|
|
"end-before": "END quicktour_init_template_processing",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START init_template_processing",
|
|
"end-before": "END init_template_processing",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
Let's go over this snippet of code in more details. First we specify
|
|
the template for single sentences: those should have the form
|
|
`"[CLS] $A [SEP]"` where
|
|
`$A` represents our sentence.
|
|
|
|
Then, we specify the template for sentence pairs, which should have the
|
|
form `"[CLS] $A [SEP] $B [SEP]"` where
|
|
`$A` represents the first sentence and
|
|
`$B` the second one. The
|
|
`:1` added in the template represent the `type IDs` we want for each part of our input: it defaults
|
|
to 0 for everything (which is why we don't have
|
|
`$A:0`) and here we set it to 1 for the
|
|
tokens of the second sentence and the last `"[SEP]"` token.
|
|
|
|
Lastly, we specify the special tokens we used and their IDs in our
|
|
tokenizer's vocabulary.
|
|
|
|
To check out this worked properly, let's try to encode the same
|
|
sentence as before:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START print_special_tokens",
|
|
"end-before": "END print_special_tokens",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_print_special_tokens",
|
|
"end-before": "END quicktour_print_special_tokens",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START print_special_tokens",
|
|
"end-before": "END print_special_tokens",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
To check the results on a pair of sentences, we just pass the two
|
|
sentences to `Tokenizer.encode`:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START print_special_tokens_pair",
|
|
"end-before": "END print_special_tokens_pair",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_print_special_tokens_pair",
|
|
"end-before": "END quicktour_print_special_tokens_pair",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START print_special_tokens_pair",
|
|
"end-before": "END print_special_tokens_pair",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
You can then check the type IDs attributed to each token is correct with
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START print_type_ids",
|
|
"end-before": "END print_type_ids",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_print_type_ids",
|
|
"end-before": "END quicktour_print_type_ids",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START print_type_ids",
|
|
"end-before": "END print_type_ids",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
If you save your tokenizer with `Tokenizer.save`, the post-processor will be saved along.
|
|
|
|
### Encoding multiple sentences in a batch
|
|
|
|
To get the full speed of the 🤗 Tokenizers library, it's best to
|
|
process your texts by batches by using the
|
|
`Tokenizer.encode_batch` method:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START encode_batch",
|
|
"end-before": "END encode_batch",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_encode_batch",
|
|
"end-before": "END quicktour_encode_batch",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START encode_batch",
|
|
"end-before": "END encode_batch",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
The output is then a list of `Encoding`
|
|
objects like the ones we saw before. You can process together as many
|
|
texts as you like, as long as it fits in memory.
|
|
|
|
To process a batch of sentences pairs, pass two lists to the
|
|
`Tokenizer.encode_batch` method: the
|
|
list of sentences A and the list of sentences B:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START encode_batch_pair",
|
|
"end-before": "END encode_batch_pair",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_encode_batch_pair",
|
|
"end-before": "END quicktour_encode_batch_pair",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START encode_batch_pair",
|
|
"end-before": "END encode_batch_pair",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
When encoding multiple sentences, you can automatically pad the outputs
|
|
to the longest sentence present by using
|
|
`Tokenizer.enable_padding`, with the
|
|
`pad_token` and its ID (which we can
|
|
double-check the id for the padding token with
|
|
`Tokenizer.token_to_id` like before):
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START enable_padding",
|
|
"end-before": "END enable_padding",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_enable_padding",
|
|
"end-before": "END quicktour_enable_padding",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START enable_padding",
|
|
"end-before": "END enable_padding",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
We can set the `direction` of the padding
|
|
(defaults to the right) or a given `length` if we want to pad every sample to that specific number (here
|
|
we leave it unset to pad to the size of the longest text).
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START print_batch_tokens",
|
|
"end-before": "END print_batch_tokens",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_print_batch_tokens",
|
|
"end-before": "END quicktour_print_batch_tokens",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START print_batch_tokens",
|
|
"end-before": "END print_batch_tokens",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
In this case, the `attention mask` generated by the
|
|
tokenizer takes the padding into account:
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
<literalinclude>
|
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
|
"language": "python",
|
|
"start-after": "START print_attention_mask",
|
|
"end-before": "END print_attention_mask",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</python>
|
|
<rust>
|
|
<literalinclude>
|
|
{"path": "../../tokenizers/tests/documentation.rs",
|
|
"language": "rust",
|
|
"start-after": "START quicktour_print_attention_mask",
|
|
"end-before": "END quicktour_print_attention_mask",
|
|
"dedent": 4}
|
|
</literalinclude>
|
|
</rust>
|
|
<node>
|
|
<literalinclude>
|
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
|
"language": "js",
|
|
"start-after": "START print_attention_mask",
|
|
"end-before": "END print_attention_mask",
|
|
"dedent": 8}
|
|
</literalinclude>
|
|
</node>
|
|
</tokenizerslangcontent>
|
|
|
|
## Pretrained
|
|
|
|
<tokenizerslangcontent>
|
|
<python>
|
|
### Using a pretrained tokenizer
|
|
|
|
You can load any tokenizer from the Hugging Face Hub as long as a
|
|
`tokenizer.json` file is available in the repository.
|
|
|
|
```python
|
|
from tokenizers import Tokenizer
|
|
|
|
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
|
|
```
|
|
|
|
### Importing a pretrained tokenizer from legacy vocabulary files
|
|
|
|
You can also import a pretrained tokenizer directly in, as long as you
|
|
have its vocabulary file. For instance, here is how to import the
|
|
classic pretrained BERT tokenizer:
|
|
|
|
```python
|
|
from tokenizers import BertWordPieceTokenizer
|
|
|
|
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
|
|
```
|
|
|
|
as long as you have downloaded the file `bert-base-uncased-vocab.txt` with
|
|
|
|
```bash
|
|
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
|
|
```
|
|
</python>
|
|
</tokenizerslangcontent> |