mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Init new docs
This commit is contained in:
40
docs/source-doc-builder/_toctree.yml
Normal file
40
docs/source-doc-builder/_toctree.yml
Normal file
@ -0,0 +1,40 @@
|
|||||||
|
- sections:
|
||||||
|
- local: index
|
||||||
|
title: 🤗 Tokenizers
|
||||||
|
- local: quicktour
|
||||||
|
title: Quicktour
|
||||||
|
- local: installation
|
||||||
|
title: Installation
|
||||||
|
- local: pipeline
|
||||||
|
title: The tokenization pipeline
|
||||||
|
- local: components
|
||||||
|
title: Components
|
||||||
|
- local: training_from_memory
|
||||||
|
title: Training from memory
|
||||||
|
title: Getting started
|
||||||
|
- sections:
|
||||||
|
- local: api/input-sequences
|
||||||
|
title: Input Sequences
|
||||||
|
- local: api/encode-inputs
|
||||||
|
title: Encode Inputs
|
||||||
|
- local: api/tokenizer
|
||||||
|
title: Tokenizer
|
||||||
|
- local: api/encoding
|
||||||
|
title: Encoding
|
||||||
|
- local: api/added-tokens
|
||||||
|
title: Added Tokens
|
||||||
|
- local: api/models
|
||||||
|
title: Models
|
||||||
|
- local: api/normalizers
|
||||||
|
title: Normalizers
|
||||||
|
- local: api/pre-tokenizers
|
||||||
|
title: Pre-tokenizers
|
||||||
|
- local: api/post-processors
|
||||||
|
title: Post-processors
|
||||||
|
- local: api/trainers
|
||||||
|
title: Trainers
|
||||||
|
- local: api/decoders
|
||||||
|
title: Decoders
|
||||||
|
- local: api/visualizer
|
||||||
|
title: Visualizer
|
||||||
|
title: API
|
15
docs/source-doc-builder/api/added-tokens.mdx
Normal file
15
docs/source-doc-builder/api/added-tokens.mdx
Normal file
@ -0,0 +1,15 @@
|
|||||||
|
# Added Tokens
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
## AddedToken[[tokenizers.AddedToken]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.AddedToken
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
The node API has not been documented yet.
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
31
docs/source-doc-builder/api/decoders.mdx
Normal file
31
docs/source-doc-builder/api/decoders.mdx
Normal file
@ -0,0 +1,31 @@
|
|||||||
|
# Decoders
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
## BPEDecoder[[tokenizers.decoders.BPEDecoder]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.decoders.BPEDecoder
|
||||||
|
|
||||||
|
## ByteLevel[[tokenizers.decoders.ByteLevel]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.decoders.ByteLevel
|
||||||
|
|
||||||
|
## CTC[[tokenizers.decoders.CTC]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.decoders.CTC
|
||||||
|
|
||||||
|
## Metaspace[[tokenizers.decoders.Metaspace]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.decoders.Metaspace
|
||||||
|
|
||||||
|
## WordPiece[[tokenizers.decoders.WordPiece]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.decoders.WordPiece
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
The node API has not been documented yet.
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
48
docs/source-doc-builder/api/encode-inputs.mdx
Normal file
48
docs/source-doc-builder/api/encode-inputs.mdx
Normal file
@ -0,0 +1,48 @@
|
|||||||
|
# Encode Inputs
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
These types represent all the different kinds of input that a [`~tokenizers.Tokenizer`] accepts
|
||||||
|
when using [`~tokenizers.Tokenizer.encode_batch`].
|
||||||
|
|
||||||
|
## TextEncodeInput[[[[tokenizers.TextEncodeInput]]]]
|
||||||
|
|
||||||
|
<code>tokenizers.TextEncodeInput</code>
|
||||||
|
|
||||||
|
Represents a textual input for encoding. Can be either:
|
||||||
|
- A single sequence: [TextInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.TextInputSequence)
|
||||||
|
- A pair of sequences:
|
||||||
|
- A Tuple of [TextInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.TextInputSequence)
|
||||||
|
- Or a List of [TextInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.TextInputSequence) of size 2
|
||||||
|
|
||||||
|
alias of `Union[str, Tuple[str, str], List[str]]`.
|
||||||
|
|
||||||
|
## PreTokenizedEncodeInput[[[[tokenizers.PreTokenizedEncodeInput]]]]
|
||||||
|
|
||||||
|
<code>tokenizers.PreTokenizedEncodeInput</code>
|
||||||
|
|
||||||
|
Represents a pre-tokenized input for encoding. Can be either:
|
||||||
|
- A single sequence: [PreTokenizedInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.PreTokenizedInputSequence)
|
||||||
|
- A pair of sequences:
|
||||||
|
- A Tuple of [PreTokenizedInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.PreTokenizedInputSequence)
|
||||||
|
- Or a List of [PreTokenizedInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.PreTokenizedInputSequence) of size 2
|
||||||
|
|
||||||
|
alias of `Union[List[str], Tuple[str], Tuple[Union[List[str], Tuple[str]], Union[List[str], Tuple[str]]], List[Union[List[str], Tuple[str]]]]`.
|
||||||
|
|
||||||
|
## EncodeInput[[[[tokenizers.EncodeInput]]]]
|
||||||
|
|
||||||
|
<code>tokenizers.EncodeInput</code>
|
||||||
|
|
||||||
|
Represents all the possible types of input for encoding. Can be:
|
||||||
|
- When `is_pretokenized=False`: [TextEncodeInput](#tokenizers.TextEncodeInput)
|
||||||
|
- When `is_pretokenized=True`: [PreTokenizedEncodeInput](#tokenizers.PreTokenizedEncodeInput)
|
||||||
|
|
||||||
|
alias of `Union[str, Tuple[str, str], List[str], Tuple[str], Tuple[Union[List[str], Tuple[str]], Union[List[str], Tuple[str]]], List[Union[List[str], Tuple[str]]]]`.
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
The node API has not been documented yet.
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
15
docs/source-doc-builder/api/encoding.mdx
Normal file
15
docs/source-doc-builder/api/encoding.mdx
Normal file
@ -0,0 +1,15 @@
|
|||||||
|
# Encoding
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
## Encoding[[tokenizers.Encoding]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.Encoding
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
The node API has not been documented yet.
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
41
docs/source-doc-builder/api/input-sequences.mdx
Normal file
41
docs/source-doc-builder/api/input-sequences.mdx
Normal file
@ -0,0 +1,41 @@
|
|||||||
|
# Input Sequences
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
These types represent all the different kinds of sequence that can be used as input of a Tokenizer.
|
||||||
|
Globally, any sequence can be either a string or a list of strings, according to the operating
|
||||||
|
mode of the tokenizer: `raw text` vs `pre-tokenized`.
|
||||||
|
|
||||||
|
## TextInputSequence[[tokenizers.TextInputSequence]]
|
||||||
|
|
||||||
|
<code>tokenizers.TextInputSequence</code>
|
||||||
|
|
||||||
|
A `str` that represents an input sequence
|
||||||
|
|
||||||
|
## PreTokenizedInputSequence[[tokenizers.PreTokenizedInputSequence]]
|
||||||
|
|
||||||
|
<code>tokenizers.PreTokenizedInputSequence</code>
|
||||||
|
|
||||||
|
A pre-tokenized input sequence. Can be one of:
|
||||||
|
- A `List` of `str`
|
||||||
|
- A `Tuple` of `str`
|
||||||
|
|
||||||
|
alias of `Union[List[str], Tuple[str]]`.
|
||||||
|
|
||||||
|
## InputSequence[[tokenizers.InputSequence]]
|
||||||
|
|
||||||
|
<code>tokenizers.InputSequence</code>
|
||||||
|
|
||||||
|
Represents all the possible types of input sequences for encoding. Can be:
|
||||||
|
- When `is_pretokenized=False`: [TextInputSequence](#tokenizers.TextInputSequence)
|
||||||
|
- When `is_pretokenized=True`: [PreTokenizedInputSequence](#tokenizers.PreTokenizedInputSequence)
|
||||||
|
|
||||||
|
alias of `Union[str, List[str], Tuple[str]]`.
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
The node API has not been documented yet.
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
31
docs/source-doc-builder/api/models.mdx
Normal file
31
docs/source-doc-builder/api/models.mdx
Normal file
@ -0,0 +1,31 @@
|
|||||||
|
# Models
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
## BPE[[tokenizers.models.BPE]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.models.BPE
|
||||||
|
|
||||||
|
## Model[[tokenizers.models.Model]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.models.Model
|
||||||
|
|
||||||
|
## Unigram[[tokenizers.models.Unigram]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.models.Unigram
|
||||||
|
|
||||||
|
## WordLevel[[tokenizers.models.WordLevel]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.models.WordLevel
|
||||||
|
|
||||||
|
## WordPiece[[tokenizers.models.WordPiece]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.models.WordPiece
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
The node API has not been documented yet.
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
63
docs/source-doc-builder/api/normalizers.mdx
Normal file
63
docs/source-doc-builder/api/normalizers.mdx
Normal file
@ -0,0 +1,63 @@
|
|||||||
|
# Normalizers
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
## BertNormalizer[[tokenizers.normalizers.BertNormalizer]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.normalizers.BertNormalizer
|
||||||
|
|
||||||
|
## Lowercase[[tokenizers.normalizers.Lowercase]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.normalizers.Lowercase
|
||||||
|
|
||||||
|
## NFC[[tokenizers.normalizers.NFC]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.normalizers.NFC
|
||||||
|
|
||||||
|
## NFD[[tokenizers.normalizers.NFD]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.normalizers.NFD
|
||||||
|
|
||||||
|
## NFKC[[tokenizers.normalizers.NFKC]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.normalizers.NFKC
|
||||||
|
|
||||||
|
## NFKD[[tokenizers.normalizers.NFKD]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.normalizers.NFKD
|
||||||
|
|
||||||
|
## Nmt[[tokenizers.normalizers.Nmt]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.normalizers.Nmt
|
||||||
|
|
||||||
|
## Normalizer[[tokenizers.normalizers.Normalizer]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.normalizers.Normalizer
|
||||||
|
|
||||||
|
## Precompiled[[tokenizers.normalizers.Precompiled]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.normalizers.Precompiled
|
||||||
|
|
||||||
|
## Replace[[tokenizers.normalizers.Replace]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.normalizers.Replace
|
||||||
|
|
||||||
|
## Sequence[[tokenizers.normalizers.Sequence]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.normalizers.Sequence
|
||||||
|
|
||||||
|
## Strip[[tokenizers.normalizers.Strip]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.normalizers.Strip
|
||||||
|
|
||||||
|
## StripAccents[[tokenizers.normalizers.StripAccents]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.normalizers.StripAccents
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
The node API has not been documented yet.
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
27
docs/source-doc-builder/api/post-processors.mdx
Normal file
27
docs/source-doc-builder/api/post-processors.mdx
Normal file
@ -0,0 +1,27 @@
|
|||||||
|
# Post-processors
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
## BertProcessing[[tokenizers.processors.BertProcessing]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.processors.BertProcessing
|
||||||
|
|
||||||
|
## ByteLevel[[tokenizers.processors.ByteLevel]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.processors.ByteLevel
|
||||||
|
|
||||||
|
## RobertaProcessing[[tokenizers.processors.RobertaProcessing]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.processors.RobertaProcessing
|
||||||
|
|
||||||
|
## TemplateProcessing[[tokenizers.processors.TemplateProcessing]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.processors.TemplateProcessing
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
The node API has not been documented yet.
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
59
docs/source-doc-builder/api/pre-tokenizers.mdx
Normal file
59
docs/source-doc-builder/api/pre-tokenizers.mdx
Normal file
@ -0,0 +1,59 @@
|
|||||||
|
# Pre-tokenizers
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
## BertPreTokenizer[[tokenizers.pre_tokenizers.BertPreTokenizer]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.pre_tokenizers.BertPreTokenizer
|
||||||
|
|
||||||
|
## ByteLevel[[tokenizers.pre_tokenizers.ByteLevel]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.pre_tokenizers.ByteLevel
|
||||||
|
|
||||||
|
## CharDelimiterSplit[[tokenizers.pre_tokenizers.CharDelimiterSplit]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.pre_tokenizers.CharDelimiterSplit
|
||||||
|
|
||||||
|
## Digits[[tokenizers.pre_tokenizers.Digits]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.pre_tokenizers.Digits
|
||||||
|
|
||||||
|
## Metaspace[[tokenizers.pre_tokenizers.Metaspace]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.pre_tokenizers.Metaspace
|
||||||
|
|
||||||
|
## PreTokenizer[[tokenizers.pre_tokenizers.PreTokenizer]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.pre_tokenizers.PreTokenizer
|
||||||
|
|
||||||
|
## Punctuation[[tokenizers.pre_tokenizers.Punctuation]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.pre_tokenizers.Punctuation
|
||||||
|
|
||||||
|
## Sequence[[tokenizers.pre_tokenizers.Sequence]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.pre_tokenizers.Sequence
|
||||||
|
|
||||||
|
## Split[[tokenizers.pre_tokenizers.Split]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.pre_tokenizers.Split
|
||||||
|
|
||||||
|
## UnicodeScripts[[tokenizers.pre_tokenizers.UnicodeScripts]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.pre_tokenizers.UnicodeScripts
|
||||||
|
|
||||||
|
## Whitespace[[tokenizers.pre_tokenizers.Whitespace]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.pre_tokenizers.Whitespace
|
||||||
|
|
||||||
|
## WhitespaceSplit[[tokenizers.pre_tokenizers.WhitespaceSplit]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.pre_tokenizers.WhitespaceSplit
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
The node API has not been documented yet.
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
15
docs/source-doc-builder/api/tokenizer.mdx
Normal file
15
docs/source-doc-builder/api/tokenizer.mdx
Normal file
@ -0,0 +1,15 @@
|
|||||||
|
# Tokenizer
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
## Tokenizer[[tokenizers.Tokenizer]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.Tokenizer
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
The node API has not been documented yet.
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
27
docs/source-doc-builder/api/trainers.mdx
Normal file
27
docs/source-doc-builder/api/trainers.mdx
Normal file
@ -0,0 +1,27 @@
|
|||||||
|
# Trainers
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
## BpeTrainer[[tokenizers.trainers.BpeTrainer]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.trainers.BpeTrainer
|
||||||
|
|
||||||
|
## UnigramTrainer[[tokenizers.trainers.UnigramTrainer]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.trainers.UnigramTrainer
|
||||||
|
|
||||||
|
## WordLevelTrainer[[tokenizers.trainers.WordLevelTrainer]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.trainers.WordLevelTrainer
|
||||||
|
|
||||||
|
## WordPieceTrainer[[tokenizers.trainers.WordPieceTrainer]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.trainers.WordPieceTrainer
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
The node API has not been documented yet.
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
20
docs/source-doc-builder/api/visualizer.mdx
Normal file
20
docs/source-doc-builder/api/visualizer.mdx
Normal file
@ -0,0 +1,20 @@
|
|||||||
|
# Visualizer
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
## Annotation[[tokenizers.tools.Annotation]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.tools.Annotation
|
||||||
|
|
||||||
|
## EncodingVisualizer[[tokenizers.tools.EncodingVisualizer]]
|
||||||
|
|
||||||
|
[[autodoc]] tokenizers.tools.EncodingVisualizer
|
||||||
|
- __call__
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
The node API has not been documented yet.
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
152
docs/source-doc-builder/components.mdx
Normal file
152
docs/source-doc-builder/components.mdx
Normal file
@ -0,0 +1,152 @@
|
|||||||
|
# Components
|
||||||
|
|
||||||
|
When building a Tokenizer, you can attach various types of components to
|
||||||
|
this Tokenizer in order to customize its behavior. This page lists most
|
||||||
|
provided components.
|
||||||
|
|
||||||
|
## Normalizers
|
||||||
|
|
||||||
|
A `Normalizer` is in charge of pre-processing the input string in order
|
||||||
|
to normalize it as relevant for a given use case. Some common examples
|
||||||
|
of normalization are the Unicode normalization algorithms (NFD, NFKD,
|
||||||
|
NFC & NFKC), lowercasing etc... The specificity of `tokenizers` is that
|
||||||
|
we keep track of the alignment while normalizing. This is essential to
|
||||||
|
allow mapping from the generated tokens back to the input text.
|
||||||
|
|
||||||
|
The `Normalizer` is optional.
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
| Name | Description | Example |
|
||||||
|
| :--- | :--- | :--- |
|
||||||
|
| NFD | NFD unicode normalization | |
|
||||||
|
| NFKD | NFKD unicode normalization | |
|
||||||
|
| NFC | NFC unicode normalization | |
|
||||||
|
| NFKC | NFKC unicode normalization | |
|
||||||
|
| Lowercase | Replaces all uppercase to lowercase | Input: `HELLO ὈΔΥΣΣΕΎΣ` <br> Output: `hello`ὀδυσσεύς` |
|
||||||
|
| Strip | Removes all whitespace characters on the specified sides (left, right or both) of the input | Input: `"`hi`"` <br> Output: `"hi"` |
|
||||||
|
| StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Ouput: `e` |
|
||||||
|
| Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Ouput: `"benene"` |
|
||||||
|
| BertNormalizer | Provides an implementation of the Normalizer used in the original BERT. Options that can be set are: <ul> <li>clean_text</li> <li>handle_chinese_chars</li> <li>strip_accents</li> <li>lowercase</li> </ul> | |
|
||||||
|
| Sequence | Composes multiple normalizers that will run in the provided order | `Sequence([NFKC(), Lowercase()])` |
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
| Name | Description | Example |
|
||||||
|
| :--- | :--- | :--- |
|
||||||
|
| NFD | NFD unicode normalization | |
|
||||||
|
| NFKD | NFKD unicode normalization | |
|
||||||
|
| NFC | NFC unicode normalization | |
|
||||||
|
| NFKC | NFKC unicode normalization | |
|
||||||
|
| Lowercase | Replaces all uppercase to lowercase | Input: `HELLO ὈΔΥΣΣΕΎΣ` <br> Output: `hello`ὀδυσσεύς` |
|
||||||
|
| Strip | Removes all whitespace characters on the specified sides (left, right or both) of the input | Input: `"`hi`"` <br> Output: `"hi"` |
|
||||||
|
| StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Ouput: `e` |
|
||||||
|
| Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Ouput: `"benene"` |
|
||||||
|
| BertNormalizer | Provides an implementation of the Normalizer used in the original BERT. Options that can be set are: <ul> <li>clean_text</li> <li>handle_chinese_chars</li> <li>strip_accents</li> <li>lowercase</li> </ul> | |
|
||||||
|
| Sequence | Composes multiple normalizers that will run in the provided order | `Sequence::new(vec![NFKC, Lowercase])` |
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
| Name | Description | Example |
|
||||||
|
| :--- | :--- | :--- |
|
||||||
|
| NFD | NFD unicode normalization | |
|
||||||
|
| NFKD | NFKD unicode normalization | |
|
||||||
|
| NFC | NFC unicode normalization | |
|
||||||
|
| NFKC | NFKC unicode normalization | |
|
||||||
|
| Lowercase | Replaces all uppercase to lowercase | Input: `HELLO ὈΔΥΣΣΕΎΣ` <br> Output: `hello`ὀδυσσεύς` |
|
||||||
|
| Strip | Removes all whitespace characters on the specified sides (left, right or both) of the input | Input: `"`hi`"` <br> Output: `"hi"` |
|
||||||
|
| StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Ouput: `e` |
|
||||||
|
| Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Ouput: `"benene"` |
|
||||||
|
| BertNormalizer | Provides an implementation of the Normalizer used in the original BERT. Options that can be set are: <ul> <li>cleanText</li> <li>handleChineseChars</li> <li>stripAccents</li> <li>lowercase</li> </ul> | |
|
||||||
|
| Sequence | Composes multiple normalizers that will run in the provided order | |
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
## Pre-tokenizers
|
||||||
|
|
||||||
|
The `PreTokenizer` takes care of splitting the input according to a set
|
||||||
|
of rules. This pre-processing lets you ensure that the underlying
|
||||||
|
`Model` does not build tokens across multiple "splits". For example if
|
||||||
|
you don't want to have whitespaces inside a token, then you can have a
|
||||||
|
`PreTokenizer` that splits on these whitespaces.
|
||||||
|
|
||||||
|
You can easily combine multiple `PreTokenizer` together using a
|
||||||
|
`Sequence` (see below). The `PreTokenizer` is also allowed to modify the
|
||||||
|
string, just like a `Normalizer` does. This is necessary to allow some
|
||||||
|
complicated algorithms that require to split before normalizing (e.g.
|
||||||
|
the ByteLevel)
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
| Name | Description | Example |
|
||||||
|
| :--- | :--- | :--- |
|
||||||
|
| ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Ouput: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"` |
|
||||||
|
| Whitespace | Splits on word boundaries (using the following regular expression: `\w+|[^\w\s]+` | Input: `"Hello there!"` <br> Output: `"Hello", "there", "!"` |
|
||||||
|
| WhitespaceSplit | Splits on any whitespace character | Input: `"Hello there!"` <br> Output: `"Hello", "there!"` |
|
||||||
|
| Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Ouput: `"Hello", "?"` |
|
||||||
|
| Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Ouput: `"Hello", "▁there"` |
|
||||||
|
| CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Ouput: `"Hello", "there"` |
|
||||||
|
| Digits | Splits the numbers from any other characters. | Input: `"Hello123there"` <br> Output: ``"Hello", "123", "there"`` |
|
||||||
|
| Split | Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary. <ul> <li>pattern should be either a custom string or regexp.</li> <li>behavior should be one of: <ul><li>removed</li><li>isolated</li><li>merged_with_previous</li><li>merged_with_next</li><li>contiguous</li></ul></li> <li>invert should be a boolean flag.</li> </ul> | Example with pattern = ` `, behavior = `"isolated"`, invert = `False`: <br> Input: `"Hello, how are you?"` <br> Output: `"Hello,", " ", "how", " ", "are", " ", "you?"` |
|
||||||
|
| Sequence | Lets you compose multiple `PreTokenizer` that will be run in the given order | `Sequence([Punctuation(), WhitespaceSplit()])` |
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
| Name | Description | Example |
|
||||||
|
| :--- | :--- | :--- |
|
||||||
|
| ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Ouput: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"` |
|
||||||
|
| Whitespace | Splits on word boundaries (using the following regular expression: `\w+|[^\w\s]+` | Input: `"Hello there!"` <br> Output: `"Hello", "there", "!"` |
|
||||||
|
| WhitespaceSplit | Splits on any whitespace character | Input: `"Hello there!"` <br> Output: `"Hello", "there!"` |
|
||||||
|
| Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Ouput: `"Hello", "?"` |
|
||||||
|
| Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Ouput: `"Hello", "▁there"` |
|
||||||
|
| CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Ouput: `"Hello", "there"` |
|
||||||
|
| Digits | Splits the numbers from any other characters. | Input: `"Hello123there"` <br> Output: ``"Hello", "123", "there"`` |
|
||||||
|
| Split | Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary. <ul> <li>pattern should be either a custom string or regexp.</li> <li>behavior should be one of: <ul><li>Removed</li><li>Isolated</li><li>MergedWithPrevious</li><li>MergedWithNext</li><li>Contiguous</li></ul></li> <li>invert should be a boolean flag.</li> </ul> | Example with pattern = ` `, behavior = `"isolated"`, invert = `False`: <br> Input: `"Hello, how are you?"` <br> Output: `"Hello,", " ", "how", " ", "are", " ", "you?"` |
|
||||||
|
| Sequence | Lets you compose multiple `PreTokenizer` that will be run in the given order | `Sequence::new(vec![Punctuation, WhitespaceSplit])` |
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
| Name | Description | Example |
|
||||||
|
| :--- | :--- | :--- |
|
||||||
|
| ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Ouput: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"` |
|
||||||
|
| Whitespace | Splits on word boundaries (using the following regular expression: `\w+|[^\w\s]+` | Input: `"Hello there!"` <br> Output: `"Hello", "there", "!"` |
|
||||||
|
| WhitespaceSplit | Splits on any whitespace character | Input: `"Hello there!"` <br> Output: `"Hello", "there!"` |
|
||||||
|
| Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Ouput: `"Hello", "?"` |
|
||||||
|
| Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Ouput: `"Hello", "▁there"` |
|
||||||
|
| CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Ouput: `"Hello", "there"` |
|
||||||
|
| Digits | Splits the numbers from any other characters. | Input: `"Hello123there"` <br> Output: ``"Hello", "123", "there"`` |
|
||||||
|
| Split | Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary. <ul> <li>pattern should be either a custom string or regexp.</li> <li>behavior should be one of: <ul><li>removed</li><li>isolated</li><li>mergedWithPrevious</li><li>mergedWithNext</li><li>contiguous</li></ul></li> <li>invert should be a boolean flag.</li> </ul> | Example with pattern = ` `, behavior = `"isolated"`, invert = `False`: <br> Input: `"Hello, how are you?"` <br> Output: `"Hello,", " ", "how", " ", "are", " ", "you?"` |
|
||||||
|
| Sequence | Lets you compose multiple `PreTokenizer` that will be run in the given order | |
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
## Models
|
||||||
|
|
||||||
|
Models are the core algorithms used to actually tokenize, and therefore,
|
||||||
|
they are the only mandatory component of a Tokenizer.
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| :--- | :--- |
|
||||||
|
| WordLevel | This is the “classic” tokenization algorithm. It let’s you simply map words to IDs without anything fancy. This has the advantage of being really simple to use and understand, but it requires extremely large vocabularies for a good coverage. Using this `Model` requires the use of a `PreTokenizer`. No choice will be made by this model directly, it simply maps input tokens to IDs. |
|
||||||
|
| BPE | One of the most popular subword tokenization algorithm. The Byte-Pair-Encoding works by starting with characters, while merging those that are the most frequently seen together, thus creating new tokens. It then works iteratively to build new tokens out of the most frequent pairs it sees in a corpus. BPE is able to build words it has never seen by using multiple subword tokens, and thus requires smaller vocabularies, with less chances of having “unk” (unknown) tokens. |
|
||||||
|
| WordPiece | This is a subword tokenization algorithm quite similar to BPE, used mainly by Google in models like BERT. It uses a greedy algorithm, that tries to build long words first, splitting in multiple tokens when entire words don’t exist in the vocabulary. This is different from BPE that starts from characters, building bigger tokens as possible. It uses the famous `##` prefix to identify tokens that are part of a word (ie not starting a word). |
|
||||||
|
| Unigram | Unigram is also a subword tokenization algorithm, and works by trying to identify the best set of subword tokens to maximize the probability for a given sentence. This is different from BPE in the way that this is not deterministic based on a set of rules applied sequentially. Instead Unigram will be able to compute multiple ways of tokenizing, while choosing the most probable one. |
|
||||||
|
|
||||||
|
## Post-Processors
|
||||||
|
|
||||||
|
After the whole pipeline, we sometimes want to insert some special
|
||||||
|
tokens before feed a tokenized string into a model like "[CLS] My
|
||||||
|
horse is amazing [SEP]". The `PostProcessor` is the component doing
|
||||||
|
just that.
|
||||||
|
|
||||||
|
| Name | Description | Example |
|
||||||
|
| :--- | :--- | :--- |
|
||||||
|
| TemplateProcessing | Let’s you easily template the post processing, adding special tokens, and specifying the `type_id` for each sequence/special token. The template is given two strings representing the single sequence and the pair of sequences, as well as a set of special tokens to use. | Example, when specifying a template with these values:<br> <ul> <li> single: `"[CLS] $A [SEP]"` </li> <li> pair: `"[CLS] $A [SEP] $B [SEP]"` </li> <li> special tokens: <ul> <li>`"[CLS]"`</li> <li>`"[SEP]"`</li> </ul> </li> </ul> <br> Input: `("I like this", "but not this")` <br> Output: `"[CLS] I like this [SEP] but not this [SEP]"` |
|
||||||
|
|
||||||
|
## Decoders
|
||||||
|
|
||||||
|
The Decoder knows how to go from the IDs used by the Tokenizer, back to
|
||||||
|
a readable piece of text. Some `Normalizer` and `PreTokenizer` use
|
||||||
|
special characters or identifiers that need to be reverted for example.
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| :--- | :--- |
|
||||||
|
| ByteLevel | Reverts the ByteLevel PreTokenizer. This PreTokenizer encodes at the byte-level, using a set of visible Unicode characters to represent each byte, so we need a Decoder to revert this process and get something readable again. |
|
||||||
|
| Metaspace | Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifer `▁` to identify whitespaces, and so this Decoder helps with decoding these. |
|
||||||
|
| WordPiece | Reverts the WordPiece Model. This model uses a special identifier `##` for continuing subwords, and so this Decoder helps with decoding these. |
|
19
docs/source-doc-builder/index.mdx
Normal file
19
docs/source-doc-builder/index.mdx
Normal file
@ -0,0 +1,19 @@
|
|||||||
|
<!-- DISABLE-FRONTMATTER-SECTIONS -->
|
||||||
|
|
||||||
|
# Tokenizers
|
||||||
|
|
||||||
|
Fast State-of-the-art tokenizers, optimized for both research and
|
||||||
|
production
|
||||||
|
|
||||||
|
[🤗 Tokenizers](https://github.com/huggingface/tokenizers) provides an
|
||||||
|
implementation of today's most used tokenizers, with a focus on
|
||||||
|
performance and versatility. These tokenizers are also used in [🤗 Transformers](https://github.com/huggingface/transformers).
|
||||||
|
|
||||||
|
# Main features:
|
||||||
|
|
||||||
|
- Train new vocabularies and tokenize, using today's most used tokenizers.
|
||||||
|
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
|
||||||
|
- Easy to use, but also extremely versatile.
|
||||||
|
- Designed for both research and production.
|
||||||
|
- Full alignment tracking. Even with destructive normalization, it's always possible to get the part of the original sentence that corresponds to any token.
|
||||||
|
- Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.
|
89
docs/source-doc-builder/installation.mdx
Normal file
89
docs/source-doc-builder/installation.mdx
Normal file
@ -0,0 +1,89 @@
|
|||||||
|
# Installation
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
🤗 Tokenizers is tested on Python 3.5+.
|
||||||
|
|
||||||
|
You should install 🤗 Tokenizers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're
|
||||||
|
unfamiliar with Python virtual environments, check out the [user
|
||||||
|
guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
|
||||||
|
Create a virtual environment with the version of Python you're going to
|
||||||
|
use and activate it.
|
||||||
|
|
||||||
|
## Installation with pip
|
||||||
|
|
||||||
|
🤗 Tokenizers can be installed using pip as follows:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install tokenizers
|
||||||
|
```
|
||||||
|
|
||||||
|
## Installation from sources
|
||||||
|
|
||||||
|
To use this method, you need to have the Rust language installed. You
|
||||||
|
can follow [the official
|
||||||
|
guide](https://www.rust-lang.org/learn/get-started) for more
|
||||||
|
information.
|
||||||
|
|
||||||
|
If you are using a unix based OS, the installation should be as simple
|
||||||
|
as running:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Or you can easiy update it with the following command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
rustup update
|
||||||
|
```
|
||||||
|
|
||||||
|
Once rust is installed, we can start retrieving the sources for 🤗
|
||||||
|
Tokenizers:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/huggingface/tokenizers
|
||||||
|
```
|
||||||
|
|
||||||
|
Then we go into the python bindings folder:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd tokenizers/bindings/python
|
||||||
|
```
|
||||||
|
|
||||||
|
At this point you should have your [virtual environment]() already
|
||||||
|
activated. In order to compile 🤗 Tokenizers, you need to install the
|
||||||
|
Python package `setuptools_rust`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install setuptools_rust
|
||||||
|
```
|
||||||
|
|
||||||
|
Then you can have 🤗 Tokenizers compiled and installed in your virtual
|
||||||
|
environment with the following command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python setup.py install
|
||||||
|
```
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
## Crates.io
|
||||||
|
|
||||||
|
🤗 Tokenizers is available on [crates.io](https://crates.io/crates/tokenizers).
|
||||||
|
|
||||||
|
You just need to add it to your `Cargo.toml`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tokenizers = "0.10"
|
||||||
|
```
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
## Installation with npm
|
||||||
|
|
||||||
|
You can simply install 🤗 Tokenizers with npm using:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
npm install tokenizers
|
||||||
|
```
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
623
docs/source-doc-builder/pipeline.mdx
Normal file
623
docs/source-doc-builder/pipeline.mdx
Normal file
@ -0,0 +1,623 @@
|
|||||||
|
# The tokenization pipeline
|
||||||
|
|
||||||
|
When calling `Tokenizer.encode` or
|
||||||
|
`Tokenizer.encode_batch`, the input
|
||||||
|
text(s) go through the following pipeline:
|
||||||
|
|
||||||
|
- `normalization`
|
||||||
|
- `pre-tokenization`
|
||||||
|
- `model`
|
||||||
|
- `post-processing`
|
||||||
|
|
||||||
|
We'll see in details what happens during each of those steps in detail,
|
||||||
|
as well as when you want to `decode <decoding>` some token ids, and how the 🤗 Tokenizers library allows you
|
||||||
|
to customize each of those steps to your needs. If you're already
|
||||||
|
familiar with those steps and want to learn by seeing some code, jump to
|
||||||
|
`our BERT from scratch example <example>`.
|
||||||
|
|
||||||
|
For the examples that require a `Tokenizer` we will use the tokenizer we trained in the
|
||||||
|
`quicktour`, which you can load with:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START reload_tokenizer",
|
||||||
|
"end-before": "END reload_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START pipeline_reload_tokenizer",
|
||||||
|
"end-before": "END pipeline_reload_tokenizer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START reload_tokenizer",
|
||||||
|
"end-before": "END reload_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
## Normalization
|
||||||
|
|
||||||
|
Normalization is, in a nutshell, a set of operations you apply to a raw
|
||||||
|
string to make it less random or "cleaner". Common operations include
|
||||||
|
stripping whitespace, removing accented characters or lowercasing all
|
||||||
|
text. If you're familiar with [Unicode
|
||||||
|
normalization](https://unicode.org/reports/tr15), it is also a very
|
||||||
|
common normalization operation applied in most tokenizers.
|
||||||
|
|
||||||
|
Each normalization operation is represented in the 🤗 Tokenizers library
|
||||||
|
by a `Normalizer`, and you can combine
|
||||||
|
several of those by using a `normalizers.Sequence`. Here is a normalizer applying NFD Unicode normalization
|
||||||
|
and removing accents as an example:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START setup_normalizer",
|
||||||
|
"end-before": "END setup_normalizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START pipeline_setup_normalizer",
|
||||||
|
"end-before": "END pipeline_setup_normalizer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START setup_normalizer",
|
||||||
|
"end-before": "END setup_normalizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
You can manually test that normalizer by applying it to any string:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START test_normalizer",
|
||||||
|
"end-before": "END test_normalizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START pipeline_test_normalizer",
|
||||||
|
"end-before": "END pipeline_test_normalizer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START test_normalizer",
|
||||||
|
"end-before": "END test_normalizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
When building a `Tokenizer`, you can
|
||||||
|
customize its normalizer by just changing the corresponding attribute:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START replace_normalizer",
|
||||||
|
"end-before": "END replace_normalizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START pipeline_replace_normalizer",
|
||||||
|
"end-before": "END pipeline_replace_normalizer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START replace_normalizer",
|
||||||
|
"end-before": "END replace_normalizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
Of course, if you change the way a tokenizer applies normalization, you
|
||||||
|
should probably retrain it from scratch afterward.
|
||||||
|
|
||||||
|
## Pre-Tokenization
|
||||||
|
|
||||||
|
Pre-tokenization is the act of splitting a text into smaller objects
|
||||||
|
that give an upper bound to what your tokens will be at the end of
|
||||||
|
training. A good way to think of this is that the pre-tokenizer will
|
||||||
|
split your text into "words" and then, your final tokens will be parts
|
||||||
|
of those words.
|
||||||
|
|
||||||
|
An easy way to pre-tokenize inputs is to split on spaces and
|
||||||
|
punctuations, which is done by the
|
||||||
|
`pre_tokenizers.Whitespace`
|
||||||
|
pre-tokenizer:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START setup_pre_tokenizer",
|
||||||
|
"end-before": "END setup_pre_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START pipeline_setup_pre_tokenizer",
|
||||||
|
"end-before": "END pipeline_setup_pre_tokenizer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START setup_pre_tokenizer",
|
||||||
|
"end-before": "END setup_pre_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
The output is a list of tuples, with each tuple containing one word and
|
||||||
|
its span in the original sentence (which is used to determine the final
|
||||||
|
`offsets` of our `Encoding`). Note that splitting on
|
||||||
|
punctuation will split contractions like `"I'm"` in this example.
|
||||||
|
|
||||||
|
You can combine together any `PreTokenizer` together. For instance, here is a pre-tokenizer that will
|
||||||
|
split on space, punctuation and digits, separating numbers in their
|
||||||
|
individual digits:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START combine_pre_tokenizer",
|
||||||
|
"end-before": "END combine_pre_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START pipeline_combine_pre_tokenizer",
|
||||||
|
"end-before": "END pipeline_combine_pre_tokenizer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START combine_pre_tokenizer",
|
||||||
|
"end-before": "END combine_pre_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
As we saw in the `quicktour`, you can
|
||||||
|
customize the pre-tokenizer of a `Tokenizer` by just changing the corresponding attribute:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START replace_pre_tokenizer",
|
||||||
|
"end-before": "END replace_pre_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START pipeline_replace_pre_tokenizer",
|
||||||
|
"end-before": "END pipeline_replace_pre_tokenizer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START replace_pre_tokenizer",
|
||||||
|
"end-before": "END replace_pre_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
Of course, if you change the way the pre-tokenizer, you should probably
|
||||||
|
retrain your tokenizer from scratch afterward.
|
||||||
|
|
||||||
|
## Model
|
||||||
|
|
||||||
|
Once the input texts are normalized and pre-tokenized, the
|
||||||
|
`Tokenizer` applies the model on the
|
||||||
|
pre-tokens. This is the part of the pipeline that needs training on your
|
||||||
|
corpus (or that has been trained if you are using a pretrained
|
||||||
|
tokenizer).
|
||||||
|
|
||||||
|
The role of the model is to split your "words" into tokens, using the
|
||||||
|
rules it has learned. It's also responsible for mapping those tokens to
|
||||||
|
their corresponding IDs in the vocabulary of the model.
|
||||||
|
|
||||||
|
This model is passed along when intializing the
|
||||||
|
`Tokenizer` so you already know how to
|
||||||
|
customize this part. Currently, the 🤗 Tokenizers library supports:
|
||||||
|
|
||||||
|
- `models.BPE`
|
||||||
|
- `models.Unigram`
|
||||||
|
- `models.WordLevel`
|
||||||
|
- `models.WordPiece`
|
||||||
|
|
||||||
|
For more details about each model and its behavior, you can check
|
||||||
|
[here](components.html#models)
|
||||||
|
|
||||||
|
## Post-Processing
|
||||||
|
|
||||||
|
Post-processing is the last step of the tokenization pipeline, to
|
||||||
|
perform any additional transformation to the
|
||||||
|
`Encoding` before it's returned, like
|
||||||
|
adding potential special tokens.
|
||||||
|
|
||||||
|
As we saw in the quick tour, we can customize the post processor of a
|
||||||
|
`Tokenizer` by setting the
|
||||||
|
corresponding attribute. For instance, here is how we can post-process
|
||||||
|
to make the inputs suitable for the BERT model:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START setup_processor",
|
||||||
|
"end-before": "END setup_processor",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START pipeline_setup_processor",
|
||||||
|
"end-before": "END pipeline_setup_processor",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START setup_processor",
|
||||||
|
"end-before": "END setup_processor",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
Note that contrarily to the pre-tokenizer or the normalizer, you don't
|
||||||
|
need to retrain a tokenizer after changing its post-processor.
|
||||||
|
|
||||||
|
## All together: a BERT tokenizer from scratch
|
||||||
|
|
||||||
|
Let's put all those pieces together to build a BERT tokenizer. First,
|
||||||
|
BERT relies on WordPiece, so we instantiate a new
|
||||||
|
`Tokenizer` with this model:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START bert_setup_tokenizer",
|
||||||
|
"end-before": "END bert_setup_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START bert_setup_tokenizer",
|
||||||
|
"end-before": "END bert_setup_tokenizer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START bert_setup_tokenizer",
|
||||||
|
"end-before": "END bert_setup_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
Then we know that BERT preprocesses texts by removing accents and
|
||||||
|
lowercasing. We also use a unicode normalizer:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START bert_setup_normalizer",
|
||||||
|
"end-before": "END bert_setup_normalizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START bert_setup_normalizer",
|
||||||
|
"end-before": "END bert_setup_normalizer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START bert_setup_normalizer",
|
||||||
|
"end-before": "END bert_setup_normalizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
The pre-tokenizer is just splitting on whitespace and punctuation:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START bert_setup_pre_tokenizer",
|
||||||
|
"end-before": "END bert_setup_pre_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START bert_setup_pre_tokenizer",
|
||||||
|
"end-before": "END bert_setup_pre_tokenizer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START bert_setup_pre_tokenizer",
|
||||||
|
"end-before": "END bert_setup_pre_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
And the post-processing uses the template we saw in the previous
|
||||||
|
section:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START bert_setup_processor",
|
||||||
|
"end-before": "END bert_setup_processor",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START bert_setup_processor",
|
||||||
|
"end-before": "END bert_setup_processor",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START bert_setup_processor",
|
||||||
|
"end-before": "END bert_setup_processor",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
We can use this tokenizer and train on it on wikitext like in the
|
||||||
|
`quicktour`:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START bert_train_tokenizer",
|
||||||
|
"end-before": "END bert_train_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START bert_train_tokenizer",
|
||||||
|
"end-before": "END bert_train_tokenizer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START bert_train_tokenizer",
|
||||||
|
"end-before": "END bert_train_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
## Decoding
|
||||||
|
|
||||||
|
On top of encoding the input texts, a `Tokenizer` also has an API for decoding, that is converting IDs
|
||||||
|
generated by your model back to a text. This is done by the methods
|
||||||
|
`Tokenizer.decode` (for one predicted text) and `Tokenizer.decode_batch` (for a batch of predictions).
|
||||||
|
|
||||||
|
The [decoder]{.title-ref} will first convert the IDs back to tokens
|
||||||
|
(using the tokenizer's vocabulary) and remove all special tokens, then
|
||||||
|
join those tokens with spaces:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START test_decoding",
|
||||||
|
"end-before": "END test_decoding",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START pipeline_test_decoding",
|
||||||
|
"end-before": "END pipeline_test_decoding",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START test_decoding",
|
||||||
|
"end-before": "END test_decoding",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
If you used a model that added special characters to represent subtokens
|
||||||
|
of a given "word" (like the `"##"` in
|
||||||
|
WordPiece) you will need to customize the [decoder]{.title-ref} to treat
|
||||||
|
them properly. If we take our previous `bert_tokenizer` for instance the
|
||||||
|
default decoing will give:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START bert_test_decoding",
|
||||||
|
"end-before": "END bert_test_decoding",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START bert_test_decoding",
|
||||||
|
"end-before": "END bert_test_decoding",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START bert_test_decoding",
|
||||||
|
"end-before": "END bert_test_decoding",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
But by changing it to a proper decoder, we get:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START bert_proper_decoding",
|
||||||
|
"end-before": "END bert_proper_decoding",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START bert_proper_decoding",
|
||||||
|
"end-before": "END bert_proper_decoding",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START bert_proper_decoding",
|
||||||
|
"end-before": "END bert_proper_decoding",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
838
docs/source-doc-builder/quicktour.mdx
Normal file
838
docs/source-doc-builder/quicktour.mdx
Normal file
@ -0,0 +1,838 @@
|
|||||||
|
# Quicktour
|
||||||
|
|
||||||
|
Let's have a quick look at the 🤗 Tokenizers library features. The
|
||||||
|
library provides an implementation of today's most used tokenizers that
|
||||||
|
is both easy to use and blazing fast.
|
||||||
|
|
||||||
|
## Build a tokenizer from scratch
|
||||||
|
|
||||||
|
To illustrate how fast the 🤗 Tokenizers library is, let's train a new
|
||||||
|
tokenizer on [wikitext-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
|
||||||
|
(516M of text) in just a few seconds. First things first, you will need
|
||||||
|
to download this dataset and unzip it with:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
|
||||||
|
unzip wikitext-103-raw-v1.zip
|
||||||
|
```
|
||||||
|
|
||||||
|
### Training the tokenizer
|
||||||
|
|
||||||
|
In this tour, we will build and train a Byte-Pair Encoding (BPE)
|
||||||
|
tokenizer. For more information about the different type of tokenizers,
|
||||||
|
check out this [guide](https://huggingface.co/transformers/tokenizer_summary.html) in
|
||||||
|
the 🤗 Transformers documentation. Here, training the tokenizer means it
|
||||||
|
will learn merge rules by:
|
||||||
|
|
||||||
|
- Start with all the characters present in the training corpus as
|
||||||
|
tokens.
|
||||||
|
- Identify the most common pair of tokens and merge it into one token.
|
||||||
|
- Repeat until the vocabulary (e.g., the number of tokens) has reached
|
||||||
|
the size we want.
|
||||||
|
|
||||||
|
The main API of the library is the `class` `Tokenizer`, here is how
|
||||||
|
we instantiate one with a BPE model:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START init_tokenizer",
|
||||||
|
"end-before": "END init_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_init_tokenizer",
|
||||||
|
"end-before": "END quicktour_init_tokenizer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START init_tokenizer",
|
||||||
|
"end-before": "END init_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
To train our tokenizer on the wikitext files, we will need to
|
||||||
|
instantiate a [trainer]{.title-ref}, in this case a
|
||||||
|
`BpeTrainer`
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START init_trainer",
|
||||||
|
"end-before": "END init_trainer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_init_trainer",
|
||||||
|
"end-before": "END quicktour_init_trainer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START init_trainer",
|
||||||
|
"end-before": "END init_trainer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
We can set the training arguments like `vocab_size` or `min_frequency` (here
|
||||||
|
left at their default values of 30,000 and 0) but the most important
|
||||||
|
part is to give the `special_tokens` we
|
||||||
|
plan to use later on (they are not used at all during training) so that
|
||||||
|
they get inserted in the vocabulary.
|
||||||
|
|
||||||
|
<Tip>
|
||||||
|
|
||||||
|
The order in which you write the special tokens list matters: here `"[UNK]"` will get the ID 0,
|
||||||
|
`"[CLS]"` will get the ID 1 and so forth.
|
||||||
|
|
||||||
|
</Tip>
|
||||||
|
|
||||||
|
We could train our tokenizer right now, but it wouldn't be optimal.
|
||||||
|
Without a pre-tokenizer that will split our inputs into words, we might
|
||||||
|
get tokens that overlap several words: for instance we could get an
|
||||||
|
`"it is"` token since those two words
|
||||||
|
often appear next to each other. Using a pre-tokenizer will ensure no
|
||||||
|
token is bigger than a word returned by the pre-tokenizer. Here we want
|
||||||
|
to train a subword BPE tokenizer, and we will use the easiest
|
||||||
|
pre-tokenizer possible by splitting on whitespace.
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START init_pretok",
|
||||||
|
"end-before": "END init_pretok",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_init_pretok",
|
||||||
|
"end-before": "END quicktour_init_pretok",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START init_pretok",
|
||||||
|
"end-before": "END init_pretok",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
Now, we can just call the `Tokenizer.train` method with any list of files we want to use:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START train",
|
||||||
|
"end-before": "END train",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_train",
|
||||||
|
"end-before": "END quicktour_train",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START train",
|
||||||
|
"end-before": "END train",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
This should only take a few seconds to train our tokenizer on the full
|
||||||
|
wikitext dataset! To save the tokenizer in one file that contains all
|
||||||
|
its configuration and vocabulary, just use the
|
||||||
|
`Tokenizer.save` method:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START save",
|
||||||
|
"end-before": "END save",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_save",
|
||||||
|
"end-before": "END quicktour_save",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START save",
|
||||||
|
"end-before": "END save",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
and you can reload your tokenizer from that file with the
|
||||||
|
`Tokenizer.from_file`
|
||||||
|
`classmethod`:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START reload_tokenizer",
|
||||||
|
"end-before": "END reload_tokenizer",
|
||||||
|
"dedent": 12}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_reload_tokenizer",
|
||||||
|
"end-before": "END quicktour_reload_tokenizer",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START reload_tokenizer",
|
||||||
|
"end-before": "END reload_tokenizer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
### Using the tokenizer
|
||||||
|
|
||||||
|
Now that we have trained a tokenizer, we can use it on any text we want
|
||||||
|
with the `Tokenizer.encode` method:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START encode",
|
||||||
|
"end-before": "END encode",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_encode",
|
||||||
|
"end-before": "END quicktour_encode",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START encode",
|
||||||
|
"end-before": "END encode",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
This applied the full pipeline of the tokenizer on the text, returning
|
||||||
|
an `Encoding` object. To learn more
|
||||||
|
about this pipeline, and how to apply (or customize) parts of it, check out `this page <pipeline>`.
|
||||||
|
|
||||||
|
This `Encoding` object then has all the
|
||||||
|
attributes you need for your deep learning model (or other). The
|
||||||
|
`tokens` attribute contains the
|
||||||
|
segmentation of your text in tokens:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START print_tokens",
|
||||||
|
"end-before": "END print_tokens",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_print_tokens",
|
||||||
|
"end-before": "END quicktour_print_tokens",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START print_tokens",
|
||||||
|
"end-before": "END print_tokens",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
Similarly, the `ids` attribute will
|
||||||
|
contain the index of each of those tokens in the tokenizer's
|
||||||
|
vocabulary:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START print_ids",
|
||||||
|
"end-before": "END print_ids",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_print_ids",
|
||||||
|
"end-before": "END quicktour_print_ids",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START print_ids",
|
||||||
|
"end-before": "END print_ids",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
An important feature of the 🤗 Tokenizers library is that it comes with
|
||||||
|
full alignment tracking, meaning you can always get the part of your
|
||||||
|
original sentence that corresponds to a given token. Those are stored in
|
||||||
|
the `offsets` attribute of our
|
||||||
|
`Encoding` object. For instance, let's
|
||||||
|
assume we would want to find back what caused the
|
||||||
|
`"[UNK]"` token to appear, which is the
|
||||||
|
token at index 9 in the list, we can just ask for the offset at the
|
||||||
|
index:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START print_offsets",
|
||||||
|
"end-before": "END print_offsets",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_print_offsets",
|
||||||
|
"end-before": "END quicktour_print_offsets",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START print_offsets",
|
||||||
|
"end-before": "END print_offsets",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
and those are the indices that correspond to the emoji in the original
|
||||||
|
sentence:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START use_offsets",
|
||||||
|
"end-before": "END use_offsets",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_use_offsets",
|
||||||
|
"end-before": "END quicktour_use_offsets",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START use_offsets",
|
||||||
|
"end-before": "END use_offsets",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
### Post-processing
|
||||||
|
|
||||||
|
We might want our tokenizer to automatically add special tokens, like
|
||||||
|
`"[CLS]"` or `"[SEP]"`. To do this, we use a post-processor.
|
||||||
|
`TemplateProcessing` is the most
|
||||||
|
commonly used, you just have to specify a template for the processing of
|
||||||
|
single sentences and pairs of sentences, along with the special tokens
|
||||||
|
and their IDs.
|
||||||
|
|
||||||
|
When we built our tokenizer, we set `"[CLS]"` and `"[SEP]"` in positions 1
|
||||||
|
and 2 of our list of special tokens, so this should be their IDs. To
|
||||||
|
double-check, we can use the `Tokenizer.token_to_id` method:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START check_sep",
|
||||||
|
"end-before": "END check_sep",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_check_sep",
|
||||||
|
"end-before": "END quicktour_check_sep",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START check_sep",
|
||||||
|
"end-before": "END check_sep",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
Here is how we can set the post-processing to give us the traditional
|
||||||
|
BERT inputs:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START init_template_processing",
|
||||||
|
"end-before": "END init_template_processing",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_init_template_processing",
|
||||||
|
"end-before": "END quicktour_init_template_processing",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START init_template_processing",
|
||||||
|
"end-before": "END init_template_processing",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
Let's go over this snippet of code in more details. First we specify
|
||||||
|
the template for single sentences: those should have the form
|
||||||
|
`"[CLS] $A [SEP]"` where
|
||||||
|
`$A` represents our sentence.
|
||||||
|
|
||||||
|
Then, we specify the template for sentence pairs, which should have the
|
||||||
|
form `"[CLS] $A [SEP] $B [SEP]"` where
|
||||||
|
`$A` represents the first sentence and
|
||||||
|
`$B` the second one. The
|
||||||
|
`:1` added in the template represent the `type IDs` we want for each part of our input: it defaults
|
||||||
|
to 0 for everything (which is why we don't have
|
||||||
|
`$A:0`) and here we set it to 1 for the
|
||||||
|
tokens of the second sentence and the last `"[SEP]"` token.
|
||||||
|
|
||||||
|
Lastly, we specify the special tokens we used and their IDs in our
|
||||||
|
tokenizer's vocabulary.
|
||||||
|
|
||||||
|
To check out this worked properly, let's try to encode the same
|
||||||
|
sentence as before:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START print_special_tokens",
|
||||||
|
"end-before": "END print_special_tokens",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_print_special_tokens",
|
||||||
|
"end-before": "END quicktour_print_special_tokens",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START print_special_tokens",
|
||||||
|
"end-before": "END print_special_tokens",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
To check the results on a pair of sentences, we just pass the two
|
||||||
|
sentences to `Tokenizer.encode`:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START print_special_tokens_pair",
|
||||||
|
"end-before": "END print_special_tokens_pair",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_print_special_tokens_pair",
|
||||||
|
"end-before": "END quicktour_print_special_tokens_pair",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START print_special_tokens_pair",
|
||||||
|
"end-before": "END print_special_tokens_pair",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
You can then check the type IDs attributed to each token is correct with
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START print_type_ids",
|
||||||
|
"end-before": "END print_type_ids",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_print_type_ids",
|
||||||
|
"end-before": "END quicktour_print_type_ids",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START print_type_ids",
|
||||||
|
"end-before": "END print_type_ids",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
If you save your tokenizer with `Tokenizer.save`, the post-processor will be saved along.
|
||||||
|
|
||||||
|
### Encoding multiple sentences in a batch
|
||||||
|
|
||||||
|
To get the full speed of the 🤗 Tokenizers library, it's best to
|
||||||
|
process your texts by batches by using the
|
||||||
|
`Tokenizer.encode_batch` method:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START encode_batch",
|
||||||
|
"end-before": "END encode_batch",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_encode_batch",
|
||||||
|
"end-before": "END quicktour_encode_batch",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START encode_batch",
|
||||||
|
"end-before": "END encode_batch",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
The output is then a list of `Encoding`
|
||||||
|
objects like the ones we saw before. You can process together as many
|
||||||
|
texts as you like, as long as it fits in memory.
|
||||||
|
|
||||||
|
To process a batch of sentences pairs, pass two lists to the
|
||||||
|
`Tokenizer.encode_batch` method: the
|
||||||
|
list of sentences A and the list of sentences B:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START encode_batch_pair",
|
||||||
|
"end-before": "END encode_batch_pair",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_encode_batch_pair",
|
||||||
|
"end-before": "END quicktour_encode_batch_pair",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START encode_batch_pair",
|
||||||
|
"end-before": "END encode_batch_pair",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
When encoding multiple sentences, you can automatically pad the outputs
|
||||||
|
to the longest sentence present by using
|
||||||
|
`Tokenizer.enable_padding`, with the
|
||||||
|
`pad_token` and its ID (which we can
|
||||||
|
double-check the id for the padding token with
|
||||||
|
`Tokenizer.token_to_id` like before):
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START enable_padding",
|
||||||
|
"end-before": "END enable_padding",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_enable_padding",
|
||||||
|
"end-before": "END quicktour_enable_padding",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START enable_padding",
|
||||||
|
"end-before": "END enable_padding",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
We can set the `direction` of the padding
|
||||||
|
(defaults to the right) or a given `length` if we want to pad every sample to that specific number (here
|
||||||
|
we leave it unset to pad to the size of the longest text).
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START print_batch_tokens",
|
||||||
|
"end-before": "END print_batch_tokens",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_print_batch_tokens",
|
||||||
|
"end-before": "END quicktour_print_batch_tokens",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START print_batch_tokens",
|
||||||
|
"end-before": "END print_batch_tokens",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
In this case, the `attention mask` generated by the
|
||||||
|
tokenizer takes the padding into account:
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START print_attention_mask",
|
||||||
|
"end-before": "END print_attention_mask",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</python>
|
||||||
|
<rust>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../tokenizers/tests/documentation.rs",
|
||||||
|
"language": "rust",
|
||||||
|
"start-after": "START quicktour_print_attention_mask",
|
||||||
|
"end-before": "END quicktour_print_attention_mask",
|
||||||
|
"dedent": 4}
|
||||||
|
</literalinclude>
|
||||||
|
</rust>
|
||||||
|
<node>
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
|
||||||
|
"language": "js",
|
||||||
|
"start-after": "START print_attention_mask",
|
||||||
|
"end-before": "END print_attention_mask",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
</node>
|
||||||
|
</tokenizerslangcontent>
|
||||||
|
|
||||||
|
## Pretrained
|
||||||
|
|
||||||
|
<tokenizerslangcontent>
|
||||||
|
<python>
|
||||||
|
### Using a pretrained tokenizer
|
||||||
|
|
||||||
|
You can load any tokenizer from the Hugging Face Hub as long as a
|
||||||
|
`tokenizer.json` file is available in the repository.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from tokenizers import Tokenizer
|
||||||
|
|
||||||
|
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Importing a pretrained tokenizer from legacy vocabulary files
|
||||||
|
|
||||||
|
You can also import a pretrained tokenizer directly in, as long as you
|
||||||
|
have its vocabulary file. For instance, here is how to import the
|
||||||
|
classic pretrained BERT tokenizer:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from tokenizers import BertWordPieceTokenizer
|
||||||
|
|
||||||
|
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
as long as you have downloaded the file `bert-base-uncased-vocab.txt` with
|
||||||
|
|
||||||
|
```bash
|
||||||
|
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
|
||||||
|
```
|
||||||
|
</python>
|
||||||
|
</tokenizerslangcontent>
|
116
docs/source-doc-builder/training_from_memory.mdx
Normal file
116
docs/source-doc-builder/training_from_memory.mdx
Normal file
@ -0,0 +1,116 @@
|
|||||||
|
# Training from memory
|
||||||
|
|
||||||
|
In the [Quicktour](quicktour.html), we saw how to build and train a
|
||||||
|
tokenizer using text files, but we can actually use any Python Iterator.
|
||||||
|
In this section we'll see a few different ways of training our
|
||||||
|
tokenizer.
|
||||||
|
|
||||||
|
For all the examples listed below, we'll use the same [`~tokenizers.Tokenizer`] and
|
||||||
|
[`~tokenizers.trainers.Trainer`], built as
|
||||||
|
following:
|
||||||
|
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START init_tokenizer_trainer",
|
||||||
|
"end-before": "END init_tokenizer_trainer",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
|
||||||
|
This tokenizer is based on the [`~tokenizers.models.Unigram`] model. It
|
||||||
|
takes care of normalizing the input using the NFKC Unicode normalization
|
||||||
|
method, and uses a [`~tokenizers.pre_tokenizers.ByteLevel`] pre-tokenizer with the corresponding decoder.
|
||||||
|
|
||||||
|
For more information on the components used here, you can check
|
||||||
|
[here](components.html)
|
||||||
|
|
||||||
|
## The most basic way
|
||||||
|
|
||||||
|
As you probably guessed already, the easiest way to train our tokenizer
|
||||||
|
is by using a `List`{.interpreted-text role="obj"}:
|
||||||
|
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START train_basic",
|
||||||
|
"end-before": "END train_basic",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
|
||||||
|
Easy, right? You can use anything working as an iterator here, be it a
|
||||||
|
`List`{.interpreted-text role="obj"}, `Tuple`{.interpreted-text
|
||||||
|
role="obj"}, or a `np.Array`{.interpreted-text role="obj"}. Anything
|
||||||
|
works as long as it provides strings.
|
||||||
|
|
||||||
|
## Using the 🤗 Datasets library
|
||||||
|
|
||||||
|
An awesome way to access one of the many datasets that exist out there
|
||||||
|
is by using the 🤗 Datasets library. For more information about it, you
|
||||||
|
should check [the official documentation
|
||||||
|
here](https://huggingface.co/docs/datasets/).
|
||||||
|
|
||||||
|
Let's start by loading our dataset:
|
||||||
|
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START load_dataset",
|
||||||
|
"end-before": "END load_dataset",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
|
||||||
|
The next step is to build an iterator over this dataset. The easiest way
|
||||||
|
to do this is probably by using a generator:
|
||||||
|
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START def_batch_iterator",
|
||||||
|
"end-before": "END def_batch_iterator",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
|
||||||
|
As you can see here, for improved efficiency we can actually provide a
|
||||||
|
batch of examples used to train, instead of iterating over them one by
|
||||||
|
one. By doing so, we can expect performances very similar to those we
|
||||||
|
got while training directly from files.
|
||||||
|
|
||||||
|
With our iterator ready, we just need to launch the training. In order
|
||||||
|
to improve the look of our progress bars, we can specify the total
|
||||||
|
length of the dataset:
|
||||||
|
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START train_datasets",
|
||||||
|
"end-before": "END train_datasets",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
|
||||||
|
And that's it!
|
||||||
|
|
||||||
|
## Using gzip files
|
||||||
|
|
||||||
|
Since gzip files in Python can be used as iterators, it is extremely
|
||||||
|
simple to train on such files:
|
||||||
|
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START single_gzip",
|
||||||
|
"end-before": "END single_gzip",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
|
||||||
|
Now if we wanted to train from multiple gzip files, it wouldn't be much
|
||||||
|
harder:
|
||||||
|
|
||||||
|
<literalinclude>
|
||||||
|
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
|
||||||
|
"language": "python",
|
||||||
|
"start-after": "START multi_gzip",
|
||||||
|
"end-before": "END multi_gzip",
|
||||||
|
"dedent": 8}
|
||||||
|
</literalinclude>
|
||||||
|
|
||||||
|
And voilà!
|
Reference in New Issue
Block a user