mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Init new docs
This commit is contained in:
623
docs/source-doc-builder/pipeline.mdx
Normal file
623
docs/source-doc-builder/pipeline.mdx
Normal file
@ -0,0 +1,623 @@
|
||||
# The tokenization pipeline
|
||||
|
||||
When calling `Tokenizer.encode` or
|
||||
`Tokenizer.encode_batch`, the input
|
||||
text(s) go through the following pipeline:
|
||||
|
||||
- `normalization`
|
||||
- `pre-tokenization`
|
||||
- `model`
|
||||
- `post-processing`
|
||||
|
||||
We'll see in details what happens during each of those steps in detail,
|
||||
as well as when you want to `decode <decoding>` some token ids, and how the 🤗 Tokenizers library allows you
|
||||
to customize each of those steps to your needs. If you're already
|
||||
familiar with those steps and want to learn by seeing some code, jump to
|
||||
`our BERT from scratch example <example>`.
|
||||
|
||||
For the examples that require a `Tokenizer` we will use the tokenizer we trained in the
|
||||
`quicktour`, which you can load with:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START reload_tokenizer",
|
||||
"end-before": "END reload_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START pipeline_reload_tokenizer",
|
||||
"end-before": "END pipeline_reload_tokenizer",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START reload_tokenizer",
|
||||
"end-before": "END reload_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
## Normalization
|
||||
|
||||
Normalization is, in a nutshell, a set of operations you apply to a raw
|
||||
string to make it less random or "cleaner". Common operations include
|
||||
stripping whitespace, removing accented characters or lowercasing all
|
||||
text. If you're familiar with [Unicode
|
||||
normalization](https://unicode.org/reports/tr15), it is also a very
|
||||
common normalization operation applied in most tokenizers.
|
||||
|
||||
Each normalization operation is represented in the 🤗 Tokenizers library
|
||||
by a `Normalizer`, and you can combine
|
||||
several of those by using a `normalizers.Sequence`. Here is a normalizer applying NFD Unicode normalization
|
||||
and removing accents as an example:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START setup_normalizer",
|
||||
"end-before": "END setup_normalizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START pipeline_setup_normalizer",
|
||||
"end-before": "END pipeline_setup_normalizer",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START setup_normalizer",
|
||||
"end-before": "END setup_normalizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
You can manually test that normalizer by applying it to any string:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START test_normalizer",
|
||||
"end-before": "END test_normalizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START pipeline_test_normalizer",
|
||||
"end-before": "END pipeline_test_normalizer",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START test_normalizer",
|
||||
"end-before": "END test_normalizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
When building a `Tokenizer`, you can
|
||||
customize its normalizer by just changing the corresponding attribute:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START replace_normalizer",
|
||||
"end-before": "END replace_normalizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START pipeline_replace_normalizer",
|
||||
"end-before": "END pipeline_replace_normalizer",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START replace_normalizer",
|
||||
"end-before": "END replace_normalizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
Of course, if you change the way a tokenizer applies normalization, you
|
||||
should probably retrain it from scratch afterward.
|
||||
|
||||
## Pre-Tokenization
|
||||
|
||||
Pre-tokenization is the act of splitting a text into smaller objects
|
||||
that give an upper bound to what your tokens will be at the end of
|
||||
training. A good way to think of this is that the pre-tokenizer will
|
||||
split your text into "words" and then, your final tokens will be parts
|
||||
of those words.
|
||||
|
||||
An easy way to pre-tokenize inputs is to split on spaces and
|
||||
punctuations, which is done by the
|
||||
`pre_tokenizers.Whitespace`
|
||||
pre-tokenizer:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START setup_pre_tokenizer",
|
||||
"end-before": "END setup_pre_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START pipeline_setup_pre_tokenizer",
|
||||
"end-before": "END pipeline_setup_pre_tokenizer",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START setup_pre_tokenizer",
|
||||
"end-before": "END setup_pre_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
The output is a list of tuples, with each tuple containing one word and
|
||||
its span in the original sentence (which is used to determine the final
|
||||
`offsets` of our `Encoding`). Note that splitting on
|
||||
punctuation will split contractions like `"I'm"` in this example.
|
||||
|
||||
You can combine together any `PreTokenizer` together. For instance, here is a pre-tokenizer that will
|
||||
split on space, punctuation and digits, separating numbers in their
|
||||
individual digits:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START combine_pre_tokenizer",
|
||||
"end-before": "END combine_pre_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START pipeline_combine_pre_tokenizer",
|
||||
"end-before": "END pipeline_combine_pre_tokenizer",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START combine_pre_tokenizer",
|
||||
"end-before": "END combine_pre_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
As we saw in the `quicktour`, you can
|
||||
customize the pre-tokenizer of a `Tokenizer` by just changing the corresponding attribute:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START replace_pre_tokenizer",
|
||||
"end-before": "END replace_pre_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START pipeline_replace_pre_tokenizer",
|
||||
"end-before": "END pipeline_replace_pre_tokenizer",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START replace_pre_tokenizer",
|
||||
"end-before": "END replace_pre_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
Of course, if you change the way the pre-tokenizer, you should probably
|
||||
retrain your tokenizer from scratch afterward.
|
||||
|
||||
## Model
|
||||
|
||||
Once the input texts are normalized and pre-tokenized, the
|
||||
`Tokenizer` applies the model on the
|
||||
pre-tokens. This is the part of the pipeline that needs training on your
|
||||
corpus (or that has been trained if you are using a pretrained
|
||||
tokenizer).
|
||||
|
||||
The role of the model is to split your "words" into tokens, using the
|
||||
rules it has learned. It's also responsible for mapping those tokens to
|
||||
their corresponding IDs in the vocabulary of the model.
|
||||
|
||||
This model is passed along when intializing the
|
||||
`Tokenizer` so you already know how to
|
||||
customize this part. Currently, the 🤗 Tokenizers library supports:
|
||||
|
||||
- `models.BPE`
|
||||
- `models.Unigram`
|
||||
- `models.WordLevel`
|
||||
- `models.WordPiece`
|
||||
|
||||
For more details about each model and its behavior, you can check
|
||||
[here](components.html#models)
|
||||
|
||||
## Post-Processing
|
||||
|
||||
Post-processing is the last step of the tokenization pipeline, to
|
||||
perform any additional transformation to the
|
||||
`Encoding` before it's returned, like
|
||||
adding potential special tokens.
|
||||
|
||||
As we saw in the quick tour, we can customize the post processor of a
|
||||
`Tokenizer` by setting the
|
||||
corresponding attribute. For instance, here is how we can post-process
|
||||
to make the inputs suitable for the BERT model:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START setup_processor",
|
||||
"end-before": "END setup_processor",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START pipeline_setup_processor",
|
||||
"end-before": "END pipeline_setup_processor",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START setup_processor",
|
||||
"end-before": "END setup_processor",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
Note that contrarily to the pre-tokenizer or the normalizer, you don't
|
||||
need to retrain a tokenizer after changing its post-processor.
|
||||
|
||||
## All together: a BERT tokenizer from scratch
|
||||
|
||||
Let's put all those pieces together to build a BERT tokenizer. First,
|
||||
BERT relies on WordPiece, so we instantiate a new
|
||||
`Tokenizer` with this model:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START bert_setup_tokenizer",
|
||||
"end-before": "END bert_setup_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START bert_setup_tokenizer",
|
||||
"end-before": "END bert_setup_tokenizer",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START bert_setup_tokenizer",
|
||||
"end-before": "END bert_setup_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
Then we know that BERT preprocesses texts by removing accents and
|
||||
lowercasing. We also use a unicode normalizer:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START bert_setup_normalizer",
|
||||
"end-before": "END bert_setup_normalizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START bert_setup_normalizer",
|
||||
"end-before": "END bert_setup_normalizer",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START bert_setup_normalizer",
|
||||
"end-before": "END bert_setup_normalizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
The pre-tokenizer is just splitting on whitespace and punctuation:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START bert_setup_pre_tokenizer",
|
||||
"end-before": "END bert_setup_pre_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START bert_setup_pre_tokenizer",
|
||||
"end-before": "END bert_setup_pre_tokenizer",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START bert_setup_pre_tokenizer",
|
||||
"end-before": "END bert_setup_pre_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
And the post-processing uses the template we saw in the previous
|
||||
section:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START bert_setup_processor",
|
||||
"end-before": "END bert_setup_processor",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START bert_setup_processor",
|
||||
"end-before": "END bert_setup_processor",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START bert_setup_processor",
|
||||
"end-before": "END bert_setup_processor",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
We can use this tokenizer and train on it on wikitext like in the
|
||||
`quicktour`:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START bert_train_tokenizer",
|
||||
"end-before": "END bert_train_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START bert_train_tokenizer",
|
||||
"end-before": "END bert_train_tokenizer",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START bert_train_tokenizer",
|
||||
"end-before": "END bert_train_tokenizer",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
## Decoding
|
||||
|
||||
On top of encoding the input texts, a `Tokenizer` also has an API for decoding, that is converting IDs
|
||||
generated by your model back to a text. This is done by the methods
|
||||
`Tokenizer.decode` (for one predicted text) and `Tokenizer.decode_batch` (for a batch of predictions).
|
||||
|
||||
The [decoder]{.title-ref} will first convert the IDs back to tokens
|
||||
(using the tokenizer's vocabulary) and remove all special tokens, then
|
||||
join those tokens with spaces:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START test_decoding",
|
||||
"end-before": "END test_decoding",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START pipeline_test_decoding",
|
||||
"end-before": "END pipeline_test_decoding",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START test_decoding",
|
||||
"end-before": "END test_decoding",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
If you used a model that added special characters to represent subtokens
|
||||
of a given "word" (like the `"##"` in
|
||||
WordPiece) you will need to customize the [decoder]{.title-ref} to treat
|
||||
them properly. If we take our previous `bert_tokenizer` for instance the
|
||||
default decoing will give:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START bert_test_decoding",
|
||||
"end-before": "END bert_test_decoding",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START bert_test_decoding",
|
||||
"end-before": "END bert_test_decoding",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START bert_test_decoding",
|
||||
"end-before": "END bert_test_decoding",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
||||
|
||||
But by changing it to a proper decoder, we get:
|
||||
|
||||
<tokenizerslangcontent>
|
||||
<python>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||
"language": "python",
|
||||
"start-after": "START bert_proper_decoding",
|
||||
"end-before": "END bert_proper_decoding",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</python>
|
||||
<rust>
|
||||
<literalinclude>
|
||||
{"path": "../../tokenizers/tests/documentation.rs",
|
||||
"language": "rust",
|
||||
"start-after": "START bert_proper_decoding",
|
||||
"end-before": "END bert_proper_decoding",
|
||||
"dedent": 4}
|
||||
</literalinclude>
|
||||
</rust>
|
||||
<node>
|
||||
<literalinclude>
|
||||
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||
"language": "js",
|
||||
"start-after": "START bert_proper_decoding",
|
||||
"end-before": "END bert_proper_decoding",
|
||||
"dedent": 8}
|
||||
</literalinclude>
|
||||
</node>
|
||||
</tokenizerslangcontent>
|
Reference in New Issue
Block a user