Init new docs

This commit is contained in:
Mishig Davaadorj
2022-04-18 09:37:14 +02:00
parent 66c9af26f6
commit 6eda286ab1
19 changed files with 2269 additions and 0 deletions

View File

@ -0,0 +1,40 @@
- sections:
- local: index
title: 🤗 Tokenizers
- local: quicktour
title: Quicktour
- local: installation
title: Installation
- local: pipeline
title: The tokenization pipeline
- local: components
title: Components
- local: training_from_memory
title: Training from memory
title: Getting started
- sections:
- local: api/input-sequences
title: Input Sequences
- local: api/encode-inputs
title: Encode Inputs
- local: api/tokenizer
title: Tokenizer
- local: api/encoding
title: Encoding
- local: api/added-tokens
title: Added Tokens
- local: api/models
title: Models
- local: api/normalizers
title: Normalizers
- local: api/pre-tokenizers
title: Pre-tokenizers
- local: api/post-processors
title: Post-processors
- local: api/trainers
title: Trainers
- local: api/decoders
title: Decoders
- local: api/visualizer
title: Visualizer
title: API

View File

@ -0,0 +1,15 @@
# Added Tokens
<tokenizerslangcontent>
<python>
## AddedToken[[tokenizers.AddedToken]]
[[autodoc]] tokenizers.AddedToken
</python>
<rust>
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
</rust>
<node>
The node API has not been documented yet.
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,31 @@
# Decoders
<tokenizerslangcontent>
<python>
## BPEDecoder[[tokenizers.decoders.BPEDecoder]]
[[autodoc]] tokenizers.decoders.BPEDecoder
## ByteLevel[[tokenizers.decoders.ByteLevel]]
[[autodoc]] tokenizers.decoders.ByteLevel
## CTC[[tokenizers.decoders.CTC]]
[[autodoc]] tokenizers.decoders.CTC
## Metaspace[[tokenizers.decoders.Metaspace]]
[[autodoc]] tokenizers.decoders.Metaspace
## WordPiece[[tokenizers.decoders.WordPiece]]
[[autodoc]] tokenizers.decoders.WordPiece
</python>
<rust>
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
</rust>
<node>
The node API has not been documented yet.
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,48 @@
# Encode Inputs
<tokenizerslangcontent>
<python>
These types represent all the different kinds of input that a [`~tokenizers.Tokenizer`] accepts
when using [`~tokenizers.Tokenizer.encode_batch`].
## TextEncodeInput[[[[tokenizers.TextEncodeInput]]]]
<code>tokenizers.TextEncodeInput</code>
Represents a textual input for encoding. Can be either:
- A single sequence: [TextInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.TextInputSequence)
- A pair of sequences:
- A Tuple of [TextInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.TextInputSequence)
- Or a List of [TextInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.TextInputSequence) of size 2
alias of `Union[str, Tuple[str, str], List[str]]`.
## PreTokenizedEncodeInput[[[[tokenizers.PreTokenizedEncodeInput]]]]
<code>tokenizers.PreTokenizedEncodeInput</code>
Represents a pre-tokenized input for encoding. Can be either:
- A single sequence: [PreTokenizedInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.PreTokenizedInputSequence)
- A pair of sequences:
- A Tuple of [PreTokenizedInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.PreTokenizedInputSequence)
- Or a List of [PreTokenizedInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.PreTokenizedInputSequence) of size 2
alias of `Union[List[str], Tuple[str], Tuple[Union[List[str], Tuple[str]], Union[List[str], Tuple[str]]], List[Union[List[str], Tuple[str]]]]`.
## EncodeInput[[[[tokenizers.EncodeInput]]]]
<code>tokenizers.EncodeInput</code>
Represents all the possible types of input for encoding. Can be:
- When `is_pretokenized=False`: [TextEncodeInput](#tokenizers.TextEncodeInput)
- When `is_pretokenized=True`: [PreTokenizedEncodeInput](#tokenizers.PreTokenizedEncodeInput)
alias of `Union[str, Tuple[str, str], List[str], Tuple[str], Tuple[Union[List[str], Tuple[str]], Union[List[str], Tuple[str]]], List[Union[List[str], Tuple[str]]]]`.
</python>
<rust>
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
</rust>
<node>
The node API has not been documented yet.
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,15 @@
# Encoding
<tokenizerslangcontent>
<python>
## Encoding[[tokenizers.Encoding]]
[[autodoc]] tokenizers.Encoding
</python>
<rust>
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
</rust>
<node>
The node API has not been documented yet.
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,41 @@
# Input Sequences
<tokenizerslangcontent>
<python>
These types represent all the different kinds of sequence that can be used as input of a Tokenizer.
Globally, any sequence can be either a string or a list of strings, according to the operating
mode of the tokenizer: `raw text` vs `pre-tokenized`.
## TextInputSequence[[tokenizers.TextInputSequence]]
<code>tokenizers.TextInputSequence</code>
A `str` that represents an input sequence
## PreTokenizedInputSequence[[tokenizers.PreTokenizedInputSequence]]
<code>tokenizers.PreTokenizedInputSequence</code>
A pre-tokenized input sequence. Can be one of:
- A `List` of `str`
- A `Tuple` of `str`
alias of `Union[List[str], Tuple[str]]`.
## InputSequence[[tokenizers.InputSequence]]
<code>tokenizers.InputSequence</code>
Represents all the possible types of input sequences for encoding. Can be:
- When `is_pretokenized=False`: [TextInputSequence](#tokenizers.TextInputSequence)
- When `is_pretokenized=True`: [PreTokenizedInputSequence](#tokenizers.PreTokenizedInputSequence)
alias of `Union[str, List[str], Tuple[str]]`.
</python>
<rust>
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
</rust>
<node>
The node API has not been documented yet.
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,31 @@
# Models
<tokenizerslangcontent>
<python>
## BPE[[tokenizers.models.BPE]]
[[autodoc]] tokenizers.models.BPE
## Model[[tokenizers.models.Model]]
[[autodoc]] tokenizers.models.Model
## Unigram[[tokenizers.models.Unigram]]
[[autodoc]] tokenizers.models.Unigram
## WordLevel[[tokenizers.models.WordLevel]]
[[autodoc]] tokenizers.models.WordLevel
## WordPiece[[tokenizers.models.WordPiece]]
[[autodoc]] tokenizers.models.WordPiece
</python>
<rust>
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
</rust>
<node>
The node API has not been documented yet.
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,63 @@
# Normalizers
<tokenizerslangcontent>
<python>
## BertNormalizer[[tokenizers.normalizers.BertNormalizer]]
[[autodoc]] tokenizers.normalizers.BertNormalizer
## Lowercase[[tokenizers.normalizers.Lowercase]]
[[autodoc]] tokenizers.normalizers.Lowercase
## NFC[[tokenizers.normalizers.NFC]]
[[autodoc]] tokenizers.normalizers.NFC
## NFD[[tokenizers.normalizers.NFD]]
[[autodoc]] tokenizers.normalizers.NFD
## NFKC[[tokenizers.normalizers.NFKC]]
[[autodoc]] tokenizers.normalizers.NFKC
## NFKD[[tokenizers.normalizers.NFKD]]
[[autodoc]] tokenizers.normalizers.NFKD
## Nmt[[tokenizers.normalizers.Nmt]]
[[autodoc]] tokenizers.normalizers.Nmt
## Normalizer[[tokenizers.normalizers.Normalizer]]
[[autodoc]] tokenizers.normalizers.Normalizer
## Precompiled[[tokenizers.normalizers.Precompiled]]
[[autodoc]] tokenizers.normalizers.Precompiled
## Replace[[tokenizers.normalizers.Replace]]
[[autodoc]] tokenizers.normalizers.Replace
## Sequence[[tokenizers.normalizers.Sequence]]
[[autodoc]] tokenizers.normalizers.Sequence
## Strip[[tokenizers.normalizers.Strip]]
[[autodoc]] tokenizers.normalizers.Strip
## StripAccents[[tokenizers.normalizers.StripAccents]]
[[autodoc]] tokenizers.normalizers.StripAccents
</python>
<rust>
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
</rust>
<node>
The node API has not been documented yet.
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,27 @@
# Post-processors
<tokenizerslangcontent>
<python>
## BertProcessing[[tokenizers.processors.BertProcessing]]
[[autodoc]] tokenizers.processors.BertProcessing
## ByteLevel[[tokenizers.processors.ByteLevel]]
[[autodoc]] tokenizers.processors.ByteLevel
## RobertaProcessing[[tokenizers.processors.RobertaProcessing]]
[[autodoc]] tokenizers.processors.RobertaProcessing
## TemplateProcessing[[tokenizers.processors.TemplateProcessing]]
[[autodoc]] tokenizers.processors.TemplateProcessing
</python>
<rust>
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
</rust>
<node>
The node API has not been documented yet.
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,59 @@
# Pre-tokenizers
<tokenizerslangcontent>
<python>
## BertPreTokenizer[[tokenizers.pre_tokenizers.BertPreTokenizer]]
[[autodoc]] tokenizers.pre_tokenizers.BertPreTokenizer
## ByteLevel[[tokenizers.pre_tokenizers.ByteLevel]]
[[autodoc]] tokenizers.pre_tokenizers.ByteLevel
## CharDelimiterSplit[[tokenizers.pre_tokenizers.CharDelimiterSplit]]
[[autodoc]] tokenizers.pre_tokenizers.CharDelimiterSplit
## Digits[[tokenizers.pre_tokenizers.Digits]]
[[autodoc]] tokenizers.pre_tokenizers.Digits
## Metaspace[[tokenizers.pre_tokenizers.Metaspace]]
[[autodoc]] tokenizers.pre_tokenizers.Metaspace
## PreTokenizer[[tokenizers.pre_tokenizers.PreTokenizer]]
[[autodoc]] tokenizers.pre_tokenizers.PreTokenizer
## Punctuation[[tokenizers.pre_tokenizers.Punctuation]]
[[autodoc]] tokenizers.pre_tokenizers.Punctuation
## Sequence[[tokenizers.pre_tokenizers.Sequence]]
[[autodoc]] tokenizers.pre_tokenizers.Sequence
## Split[[tokenizers.pre_tokenizers.Split]]
[[autodoc]] tokenizers.pre_tokenizers.Split
## UnicodeScripts[[tokenizers.pre_tokenizers.UnicodeScripts]]
[[autodoc]] tokenizers.pre_tokenizers.UnicodeScripts
## Whitespace[[tokenizers.pre_tokenizers.Whitespace]]
[[autodoc]] tokenizers.pre_tokenizers.Whitespace
## WhitespaceSplit[[tokenizers.pre_tokenizers.WhitespaceSplit]]
[[autodoc]] tokenizers.pre_tokenizers.WhitespaceSplit
</python>
<rust>
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
</rust>
<node>
The node API has not been documented yet.
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,15 @@
# Tokenizer
<tokenizerslangcontent>
<python>
## Tokenizer[[tokenizers.Tokenizer]]
[[autodoc]] tokenizers.Tokenizer
</python>
<rust>
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
</rust>
<node>
The node API has not been documented yet.
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,27 @@
# Trainers
<tokenizerslangcontent>
<python>
## BpeTrainer[[tokenizers.trainers.BpeTrainer]]
[[autodoc]] tokenizers.trainers.BpeTrainer
## UnigramTrainer[[tokenizers.trainers.UnigramTrainer]]
[[autodoc]] tokenizers.trainers.UnigramTrainer
## WordLevelTrainer[[tokenizers.trainers.WordLevelTrainer]]
[[autodoc]] tokenizers.trainers.WordLevelTrainer
## WordPieceTrainer[[tokenizers.trainers.WordPieceTrainer]]
[[autodoc]] tokenizers.trainers.WordPieceTrainer
</python>
<rust>
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
</rust>
<node>
The node API has not been documented yet.
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,20 @@
# Visualizer
<tokenizerslangcontent>
<python>
## Annotation[[tokenizers.tools.Annotation]]
[[autodoc]] tokenizers.tools.Annotation
## EncodingVisualizer[[tokenizers.tools.EncodingVisualizer]]
[[autodoc]] tokenizers.tools.EncodingVisualizer
- __call__
</python>
<rust>
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
</rust>
<node>
The node API has not been documented yet.
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,152 @@
# Components
When building a Tokenizer, you can attach various types of components to
this Tokenizer in order to customize its behavior. This page lists most
provided components.
## Normalizers
A `Normalizer` is in charge of pre-processing the input string in order
to normalize it as relevant for a given use case. Some common examples
of normalization are the Unicode normalization algorithms (NFD, NFKD,
NFC & NFKC), lowercasing etc... The specificity of `tokenizers` is that
we keep track of the alignment while normalizing. This is essential to
allow mapping from the generated tokens back to the input text.
The `Normalizer` is optional.
<tokenizerslangcontent>
<python>
| Name | Description | Example |
| :--- | :--- | :--- |
| NFD | NFD unicode normalization | |
| NFKD | NFKD unicode normalization | |
| NFC | NFC unicode normalization | |
| NFKC | NFKC unicode normalization | |
| Lowercase | Replaces all uppercase to lowercase | Input: `HELLO ὈΔΥΣΣΕΎΣ` <br> Output: `hello`ὀδυσσεύς` |
| Strip | Removes all whitespace characters on the specified sides (left, right or both) of the input | Input: `"`hi`"` <br> Output: `"hi"` |
| StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Ouput: `e` |
| Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Ouput: `"benene"` |
| BertNormalizer | Provides an implementation of the Normalizer used in the original BERT. Options that can be set are: <ul> <li>clean_text</li> <li>handle_chinese_chars</li> <li>strip_accents</li> <li>lowercase</li> </ul> | |
| Sequence | Composes multiple normalizers that will run in the provided order | `Sequence([NFKC(), Lowercase()])` |
</python>
<rust>
| Name | Description | Example |
| :--- | :--- | :--- |
| NFD | NFD unicode normalization | |
| NFKD | NFKD unicode normalization | |
| NFC | NFC unicode normalization | |
| NFKC | NFKC unicode normalization | |
| Lowercase | Replaces all uppercase to lowercase | Input: `HELLO ὈΔΥΣΣΕΎΣ` <br> Output: `hello`ὀδυσσεύς` |
| Strip | Removes all whitespace characters on the specified sides (left, right or both) of the input | Input: `"`hi`"` <br> Output: `"hi"` |
| StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Ouput: `e` |
| Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Ouput: `"benene"` |
| BertNormalizer | Provides an implementation of the Normalizer used in the original BERT. Options that can be set are: <ul> <li>clean_text</li> <li>handle_chinese_chars</li> <li>strip_accents</li> <li>lowercase</li> </ul> | |
| Sequence | Composes multiple normalizers that will run in the provided order | `Sequence::new(vec![NFKC, Lowercase])` |
</rust>
<node>
| Name | Description | Example |
| :--- | :--- | :--- |
| NFD | NFD unicode normalization | |
| NFKD | NFKD unicode normalization | |
| NFC | NFC unicode normalization | |
| NFKC | NFKC unicode normalization | |
| Lowercase | Replaces all uppercase to lowercase | Input: `HELLO ὈΔΥΣΣΕΎΣ` <br> Output: `hello`ὀδυσσεύς` |
| Strip | Removes all whitespace characters on the specified sides (left, right or both) of the input | Input: `"`hi`"` <br> Output: `"hi"` |
| StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Ouput: `e` |
| Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Ouput: `"benene"` |
| BertNormalizer | Provides an implementation of the Normalizer used in the original BERT. Options that can be set are: <ul> <li>cleanText</li> <li>handleChineseChars</li> <li>stripAccents</li> <li>lowercase</li> </ul> | |
| Sequence | Composes multiple normalizers that will run in the provided order | |
</node>
</tokenizerslangcontent>
## Pre-tokenizers
The `PreTokenizer` takes care of splitting the input according to a set
of rules. This pre-processing lets you ensure that the underlying
`Model` does not build tokens across multiple "splits". For example if
you don't want to have whitespaces inside a token, then you can have a
`PreTokenizer` that splits on these whitespaces.
You can easily combine multiple `PreTokenizer` together using a
`Sequence` (see below). The `PreTokenizer` is also allowed to modify the
string, just like a `Normalizer` does. This is necessary to allow some
complicated algorithms that require to split before normalizing (e.g.
the ByteLevel)
<tokenizerslangcontent>
<python>
| Name | Description | Example |
| :--- | :--- | :--- |
| ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Ouput: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"` |
| Whitespace | Splits on word boundaries (using the following regular expression: `\w+&#124;[^\w\s]+` | Input: `"Hello there!"` <br> Output: `"Hello", "there", "!"` |
| WhitespaceSplit | Splits on any whitespace character | Input: `"Hello there!"` <br> Output: `"Hello", "there!"` |
| Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Ouput: `"Hello", "?"` |
| Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Ouput: `"Hello", "▁there"` |
| CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Ouput: `"Hello", "there"` |
| Digits | Splits the numbers from any other characters. | Input: `"Hello123there"` <br> Output: ``"Hello", "123", "there"`` |
| Split | Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary. <ul> <li>pattern should be either a custom string or regexp.</li> <li>behavior should be one of: <ul><li>removed</li><li>isolated</li><li>merged_with_previous</li><li>merged_with_next</li><li>contiguous</li></ul></li> <li>invert should be a boolean flag.</li> </ul> | Example with pattern = ` `, behavior = `"isolated"`, invert = `False`: <br> Input: `"Hello, how are you?"` <br> Output: `"Hello,", " ", "how", " ", "are", " ", "you?"` |
| Sequence | Lets you compose multiple `PreTokenizer` that will be run in the given order | `Sequence([Punctuation(), WhitespaceSplit()])` |
</python>
<rust>
| Name | Description | Example |
| :--- | :--- | :--- |
| ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Ouput: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"` |
| Whitespace | Splits on word boundaries (using the following regular expression: `\w+&#124;[^\w\s]+` | Input: `"Hello there!"` <br> Output: `"Hello", "there", "!"` |
| WhitespaceSplit | Splits on any whitespace character | Input: `"Hello there!"` <br> Output: `"Hello", "there!"` |
| Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Ouput: `"Hello", "?"` |
| Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Ouput: `"Hello", "▁there"` |
| CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Ouput: `"Hello", "there"` |
| Digits | Splits the numbers from any other characters. | Input: `"Hello123there"` <br> Output: ``"Hello", "123", "there"`` |
| Split | Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary. <ul> <li>pattern should be either a custom string or regexp.</li> <li>behavior should be one of: <ul><li>Removed</li><li>Isolated</li><li>MergedWithPrevious</li><li>MergedWithNext</li><li>Contiguous</li></ul></li> <li>invert should be a boolean flag.</li> </ul> | Example with pattern = ` `, behavior = `"isolated"`, invert = `False`: <br> Input: `"Hello, how are you?"` <br> Output: `"Hello,", " ", "how", " ", "are", " ", "you?"` |
| Sequence | Lets you compose multiple `PreTokenizer` that will be run in the given order | `Sequence::new(vec![Punctuation, WhitespaceSplit])` |
</rust>
<node>
| Name | Description | Example |
| :--- | :--- | :--- |
| ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Ouput: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"` |
| Whitespace | Splits on word boundaries (using the following regular expression: `\w+&#124;[^\w\s]+` | Input: `"Hello there!"` <br> Output: `"Hello", "there", "!"` |
| WhitespaceSplit | Splits on any whitespace character | Input: `"Hello there!"` <br> Output: `"Hello", "there!"` |
| Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Ouput: `"Hello", "?"` |
| Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Ouput: `"Hello", "▁there"` |
| CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Ouput: `"Hello", "there"` |
| Digits | Splits the numbers from any other characters. | Input: `"Hello123there"` <br> Output: ``"Hello", "123", "there"`` |
| Split | Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary. <ul> <li>pattern should be either a custom string or regexp.</li> <li>behavior should be one of: <ul><li>removed</li><li>isolated</li><li>mergedWithPrevious</li><li>mergedWithNext</li><li>contiguous</li></ul></li> <li>invert should be a boolean flag.</li> </ul> | Example with pattern = ` `, behavior = `"isolated"`, invert = `False`: <br> Input: `"Hello, how are you?"` <br> Output: `"Hello,", " ", "how", " ", "are", " ", "you?"` |
| Sequence | Lets you compose multiple `PreTokenizer` that will be run in the given order | |
</node>
</tokenizerslangcontent>
## Models
Models are the core algorithms used to actually tokenize, and therefore,
they are the only mandatory component of a Tokenizer.
| Name | Description |
| :--- | :--- |
| WordLevel | This is the “classic” tokenization algorithm. It lets you simply map words to IDs without anything fancy. This has the advantage of being really simple to use and understand, but it requires extremely large vocabularies for a good coverage. Using this `Model` requires the use of a `PreTokenizer`. No choice will be made by this model directly, it simply maps input tokens to IDs. |
| BPE | One of the most popular subword tokenization algorithm. The Byte-Pair-Encoding works by starting with characters, while merging those that are the most frequently seen together, thus creating new tokens. It then works iteratively to build new tokens out of the most frequent pairs it sees in a corpus. BPE is able to build words it has never seen by using multiple subword tokens, and thus requires smaller vocabularies, with less chances of having “unk” (unknown) tokens. |
| WordPiece | This is a subword tokenization algorithm quite similar to BPE, used mainly by Google in models like BERT. It uses a greedy algorithm, that tries to build long words first, splitting in multiple tokens when entire words dont exist in the vocabulary. This is different from BPE that starts from characters, building bigger tokens as possible. It uses the famous `##` prefix to identify tokens that are part of a word (ie not starting a word). |
| Unigram | Unigram is also a subword tokenization algorithm, and works by trying to identify the best set of subword tokens to maximize the probability for a given sentence. This is different from BPE in the way that this is not deterministic based on a set of rules applied sequentially. Instead Unigram will be able to compute multiple ways of tokenizing, while choosing the most probable one. |
## Post-Processors
After the whole pipeline, we sometimes want to insert some special
tokens before feed a tokenized string into a model like "[CLS] My
horse is amazing [SEP]". The `PostProcessor` is the component doing
just that.
| Name | Description | Example |
| :--- | :--- | :--- |
| TemplateProcessing | Lets you easily template the post processing, adding special tokens, and specifying the `type_id` for each sequence/special token. The template is given two strings representing the single sequence and the pair of sequences, as well as a set of special tokens to use. | Example, when specifying a template with these values:<br> <ul> <li> single: `"[CLS] $A [SEP]"` </li> <li> pair: `"[CLS] $A [SEP] $B [SEP]"` </li> <li> special tokens: <ul> <li>`"[CLS]"`</li> <li>`"[SEP]"`</li> </ul> </li> </ul> <br> Input: `("I like this", "but not this")` <br> Output: `"[CLS] I like this [SEP] but not this [SEP]"` |
## Decoders
The Decoder knows how to go from the IDs used by the Tokenizer, back to
a readable piece of text. Some `Normalizer` and `PreTokenizer` use
special characters or identifiers that need to be reverted for example.
| Name | Description |
| :--- | :--- |
| ByteLevel | Reverts the ByteLevel PreTokenizer. This PreTokenizer encodes at the byte-level, using a set of visible Unicode characters to represent each byte, so we need a Decoder to revert this process and get something readable again. |
| Metaspace | Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifer `▁` to identify whitespaces, and so this Decoder helps with decoding these. |
| WordPiece | Reverts the WordPiece Model. This model uses a special identifier `##` for continuing subwords, and so this Decoder helps with decoding these. |

View File

@ -0,0 +1,19 @@
<!-- DISABLE-FRONTMATTER-SECTIONS -->
# Tokenizers
Fast State-of-the-art tokenizers, optimized for both research and
production
[🤗 Tokenizers](https://github.com/huggingface/tokenizers) provides an
implementation of today's most used tokenizers, with a focus on
performance and versatility. These tokenizers are also used in [🤗 Transformers](https://github.com/huggingface/transformers).
# Main features:
- Train new vocabularies and tokenize, using today's most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for both research and production.
- Full alignment tracking. Even with destructive normalization, it's always possible to get the part of the original sentence that corresponds to any token.
- Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.

View File

@ -0,0 +1,89 @@
# Installation
<tokenizerslangcontent>
<python>
🤗 Tokenizers is tested on Python 3.5+.
You should install 🤗 Tokenizers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're
unfamiliar with Python virtual environments, check out the [user
guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
Create a virtual environment with the version of Python you're going to
use and activate it.
## Installation with pip
🤗 Tokenizers can be installed using pip as follows:
```bash
pip install tokenizers
```
## Installation from sources
To use this method, you need to have the Rust language installed. You
can follow [the official
guide](https://www.rust-lang.org/learn/get-started) for more
information.
If you are using a unix based OS, the installation should be as simple
as running:
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
Or you can easiy update it with the following command:
```bash
rustup update
```
Once rust is installed, we can start retrieving the sources for 🤗
Tokenizers:
```bash
git clone https://github.com/huggingface/tokenizers
```
Then we go into the python bindings folder:
```bash
cd tokenizers/bindings/python
```
At this point you should have your [virtual environment]() already
activated. In order to compile 🤗 Tokenizers, you need to install the
Python package `setuptools_rust`:
```bash
pip install setuptools_rust
```
Then you can have 🤗 Tokenizers compiled and installed in your virtual
environment with the following command:
```bash
python setup.py install
```
</python>
<rust>
## Crates.io
🤗 Tokenizers is available on [crates.io](https://crates.io/crates/tokenizers).
You just need to add it to your `Cargo.toml`:
```bash
tokenizers = "0.10"
```
</rust>
<node>
## Installation with npm
You can simply install 🤗 Tokenizers with npm using:
```bash
npm install tokenizers
```
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,623 @@
# The tokenization pipeline
When calling `Tokenizer.encode` or
`Tokenizer.encode_batch`, the input
text(s) go through the following pipeline:
- `normalization`
- `pre-tokenization`
- `model`
- `post-processing`
We'll see in details what happens during each of those steps in detail,
as well as when you want to `decode <decoding>` some token ids, and how the 🤗 Tokenizers library allows you
to customize each of those steps to your needs. If you're already
familiar with those steps and want to learn by seeing some code, jump to
`our BERT from scratch example <example>`.
For the examples that require a `Tokenizer` we will use the tokenizer we trained in the
`quicktour`, which you can load with:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START reload_tokenizer",
"end-before": "END reload_tokenizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START pipeline_reload_tokenizer",
"end-before": "END pipeline_reload_tokenizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START reload_tokenizer",
"end-before": "END reload_tokenizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
## Normalization
Normalization is, in a nutshell, a set of operations you apply to a raw
string to make it less random or "cleaner". Common operations include
stripping whitespace, removing accented characters or lowercasing all
text. If you're familiar with [Unicode
normalization](https://unicode.org/reports/tr15), it is also a very
common normalization operation applied in most tokenizers.
Each normalization operation is represented in the 🤗 Tokenizers library
by a `Normalizer`, and you can combine
several of those by using a `normalizers.Sequence`. Here is a normalizer applying NFD Unicode normalization
and removing accents as an example:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START setup_normalizer",
"end-before": "END setup_normalizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START pipeline_setup_normalizer",
"end-before": "END pipeline_setup_normalizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START setup_normalizer",
"end-before": "END setup_normalizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
You can manually test that normalizer by applying it to any string:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START test_normalizer",
"end-before": "END test_normalizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START pipeline_test_normalizer",
"end-before": "END pipeline_test_normalizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START test_normalizer",
"end-before": "END test_normalizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
When building a `Tokenizer`, you can
customize its normalizer by just changing the corresponding attribute:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START replace_normalizer",
"end-before": "END replace_normalizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START pipeline_replace_normalizer",
"end-before": "END pipeline_replace_normalizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START replace_normalizer",
"end-before": "END replace_normalizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
Of course, if you change the way a tokenizer applies normalization, you
should probably retrain it from scratch afterward.
## Pre-Tokenization
Pre-tokenization is the act of splitting a text into smaller objects
that give an upper bound to what your tokens will be at the end of
training. A good way to think of this is that the pre-tokenizer will
split your text into "words" and then, your final tokens will be parts
of those words.
An easy way to pre-tokenize inputs is to split on spaces and
punctuations, which is done by the
`pre_tokenizers.Whitespace`
pre-tokenizer:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START setup_pre_tokenizer",
"end-before": "END setup_pre_tokenizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START pipeline_setup_pre_tokenizer",
"end-before": "END pipeline_setup_pre_tokenizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START setup_pre_tokenizer",
"end-before": "END setup_pre_tokenizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
The output is a list of tuples, with each tuple containing one word and
its span in the original sentence (which is used to determine the final
`offsets` of our `Encoding`). Note that splitting on
punctuation will split contractions like `"I'm"` in this example.
You can combine together any `PreTokenizer` together. For instance, here is a pre-tokenizer that will
split on space, punctuation and digits, separating numbers in their
individual digits:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START combine_pre_tokenizer",
"end-before": "END combine_pre_tokenizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START pipeline_combine_pre_tokenizer",
"end-before": "END pipeline_combine_pre_tokenizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START combine_pre_tokenizer",
"end-before": "END combine_pre_tokenizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
As we saw in the `quicktour`, you can
customize the pre-tokenizer of a `Tokenizer` by just changing the corresponding attribute:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START replace_pre_tokenizer",
"end-before": "END replace_pre_tokenizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START pipeline_replace_pre_tokenizer",
"end-before": "END pipeline_replace_pre_tokenizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START replace_pre_tokenizer",
"end-before": "END replace_pre_tokenizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
Of course, if you change the way the pre-tokenizer, you should probably
retrain your tokenizer from scratch afterward.
## Model
Once the input texts are normalized and pre-tokenized, the
`Tokenizer` applies the model on the
pre-tokens. This is the part of the pipeline that needs training on your
corpus (or that has been trained if you are using a pretrained
tokenizer).
The role of the model is to split your "words" into tokens, using the
rules it has learned. It's also responsible for mapping those tokens to
their corresponding IDs in the vocabulary of the model.
This model is passed along when intializing the
`Tokenizer` so you already know how to
customize this part. Currently, the 🤗 Tokenizers library supports:
- `models.BPE`
- `models.Unigram`
- `models.WordLevel`
- `models.WordPiece`
For more details about each model and its behavior, you can check
[here](components.html#models)
## Post-Processing
Post-processing is the last step of the tokenization pipeline, to
perform any additional transformation to the
`Encoding` before it's returned, like
adding potential special tokens.
As we saw in the quick tour, we can customize the post processor of a
`Tokenizer` by setting the
corresponding attribute. For instance, here is how we can post-process
to make the inputs suitable for the BERT model:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START setup_processor",
"end-before": "END setup_processor",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START pipeline_setup_processor",
"end-before": "END pipeline_setup_processor",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START setup_processor",
"end-before": "END setup_processor",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
Note that contrarily to the pre-tokenizer or the normalizer, you don't
need to retrain a tokenizer after changing its post-processor.
## All together: a BERT tokenizer from scratch
Let's put all those pieces together to build a BERT tokenizer. First,
BERT relies on WordPiece, so we instantiate a new
`Tokenizer` with this model:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START bert_setup_tokenizer",
"end-before": "END bert_setup_tokenizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START bert_setup_tokenizer",
"end-before": "END bert_setup_tokenizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START bert_setup_tokenizer",
"end-before": "END bert_setup_tokenizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
Then we know that BERT preprocesses texts by removing accents and
lowercasing. We also use a unicode normalizer:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START bert_setup_normalizer",
"end-before": "END bert_setup_normalizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START bert_setup_normalizer",
"end-before": "END bert_setup_normalizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START bert_setup_normalizer",
"end-before": "END bert_setup_normalizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
The pre-tokenizer is just splitting on whitespace and punctuation:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START bert_setup_pre_tokenizer",
"end-before": "END bert_setup_pre_tokenizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START bert_setup_pre_tokenizer",
"end-before": "END bert_setup_pre_tokenizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START bert_setup_pre_tokenizer",
"end-before": "END bert_setup_pre_tokenizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
And the post-processing uses the template we saw in the previous
section:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START bert_setup_processor",
"end-before": "END bert_setup_processor",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START bert_setup_processor",
"end-before": "END bert_setup_processor",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START bert_setup_processor",
"end-before": "END bert_setup_processor",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
We can use this tokenizer and train on it on wikitext like in the
`quicktour`:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START bert_train_tokenizer",
"end-before": "END bert_train_tokenizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START bert_train_tokenizer",
"end-before": "END bert_train_tokenizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START bert_train_tokenizer",
"end-before": "END bert_train_tokenizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
## Decoding
On top of encoding the input texts, a `Tokenizer` also has an API for decoding, that is converting IDs
generated by your model back to a text. This is done by the methods
`Tokenizer.decode` (for one predicted text) and `Tokenizer.decode_batch` (for a batch of predictions).
The [decoder]{.title-ref} will first convert the IDs back to tokens
(using the tokenizer's vocabulary) and remove all special tokens, then
join those tokens with spaces:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START test_decoding",
"end-before": "END test_decoding",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START pipeline_test_decoding",
"end-before": "END pipeline_test_decoding",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START test_decoding",
"end-before": "END test_decoding",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
If you used a model that added special characters to represent subtokens
of a given "word" (like the `"##"` in
WordPiece) you will need to customize the [decoder]{.title-ref} to treat
them properly. If we take our previous `bert_tokenizer` for instance the
default decoing will give:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START bert_test_decoding",
"end-before": "END bert_test_decoding",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START bert_test_decoding",
"end-before": "END bert_test_decoding",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START bert_test_decoding",
"end-before": "END bert_test_decoding",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
But by changing it to a proper decoder, we get:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_pipeline.py",
"language": "python",
"start-after": "START bert_proper_decoding",
"end-before": "END bert_proper_decoding",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START bert_proper_decoding",
"end-before": "END bert_proper_decoding",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
"language": "js",
"start-after": "START bert_proper_decoding",
"end-before": "END bert_proper_decoding",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

View File

@ -0,0 +1,838 @@
# Quicktour
Let's have a quick look at the 🤗 Tokenizers library features. The
library provides an implementation of today's most used tokenizers that
is both easy to use and blazing fast.
## Build a tokenizer from scratch
To illustrate how fast the 🤗 Tokenizers library is, let's train a new
tokenizer on [wikitext-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
(516M of text) in just a few seconds. First things first, you will need
to download this dataset and unzip it with:
``` bash
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
```
### Training the tokenizer
In this tour, we will build and train a Byte-Pair Encoding (BPE)
tokenizer. For more information about the different type of tokenizers,
check out this [guide](https://huggingface.co/transformers/tokenizer_summary.html) in
the 🤗 Transformers documentation. Here, training the tokenizer means it
will learn merge rules by:
- Start with all the characters present in the training corpus as
tokens.
- Identify the most common pair of tokens and merge it into one token.
- Repeat until the vocabulary (e.g., the number of tokens) has reached
the size we want.
The main API of the library is the `class` `Tokenizer`, here is how
we instantiate one with a BPE model:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START init_tokenizer",
"end-before": "END init_tokenizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_init_tokenizer",
"end-before": "END quicktour_init_tokenizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START init_tokenizer",
"end-before": "END init_tokenizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
To train our tokenizer on the wikitext files, we will need to
instantiate a [trainer]{.title-ref}, in this case a
`BpeTrainer`
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START init_trainer",
"end-before": "END init_trainer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_init_trainer",
"end-before": "END quicktour_init_trainer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START init_trainer",
"end-before": "END init_trainer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
We can set the training arguments like `vocab_size` or `min_frequency` (here
left at their default values of 30,000 and 0) but the most important
part is to give the `special_tokens` we
plan to use later on (they are not used at all during training) so that
they get inserted in the vocabulary.
<Tip>
The order in which you write the special tokens list matters: here `"[UNK]"` will get the ID 0,
`"[CLS]"` will get the ID 1 and so forth.
</Tip>
We could train our tokenizer right now, but it wouldn't be optimal.
Without a pre-tokenizer that will split our inputs into words, we might
get tokens that overlap several words: for instance we could get an
`"it is"` token since those two words
often appear next to each other. Using a pre-tokenizer will ensure no
token is bigger than a word returned by the pre-tokenizer. Here we want
to train a subword BPE tokenizer, and we will use the easiest
pre-tokenizer possible by splitting on whitespace.
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START init_pretok",
"end-before": "END init_pretok",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_init_pretok",
"end-before": "END quicktour_init_pretok",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START init_pretok",
"end-before": "END init_pretok",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
Now, we can just call the `Tokenizer.train` method with any list of files we want to use:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START train",
"end-before": "END train",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_train",
"end-before": "END quicktour_train",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START train",
"end-before": "END train",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
This should only take a few seconds to train our tokenizer on the full
wikitext dataset! To save the tokenizer in one file that contains all
its configuration and vocabulary, just use the
`Tokenizer.save` method:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START save",
"end-before": "END save",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_save",
"end-before": "END quicktour_save",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START save",
"end-before": "END save",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
and you can reload your tokenizer from that file with the
`Tokenizer.from_file`
`classmethod`:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START reload_tokenizer",
"end-before": "END reload_tokenizer",
"dedent": 12}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_reload_tokenizer",
"end-before": "END quicktour_reload_tokenizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START reload_tokenizer",
"end-before": "END reload_tokenizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
### Using the tokenizer
Now that we have trained a tokenizer, we can use it on any text we want
with the `Tokenizer.encode` method:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START encode",
"end-before": "END encode",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_encode",
"end-before": "END quicktour_encode",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START encode",
"end-before": "END encode",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
This applied the full pipeline of the tokenizer on the text, returning
an `Encoding` object. To learn more
about this pipeline, and how to apply (or customize) parts of it, check out `this page <pipeline>`.
This `Encoding` object then has all the
attributes you need for your deep learning model (or other). The
`tokens` attribute contains the
segmentation of your text in tokens:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_tokens",
"end-before": "END print_tokens",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_tokens",
"end-before": "END quicktour_print_tokens",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_tokens",
"end-before": "END print_tokens",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
Similarly, the `ids` attribute will
contain the index of each of those tokens in the tokenizer's
vocabulary:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_ids",
"end-before": "END print_ids",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_ids",
"end-before": "END quicktour_print_ids",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_ids",
"end-before": "END print_ids",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
An important feature of the 🤗 Tokenizers library is that it comes with
full alignment tracking, meaning you can always get the part of your
original sentence that corresponds to a given token. Those are stored in
the `offsets` attribute of our
`Encoding` object. For instance, let's
assume we would want to find back what caused the
`"[UNK]"` token to appear, which is the
token at index 9 in the list, we can just ask for the offset at the
index:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_offsets",
"end-before": "END print_offsets",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_offsets",
"end-before": "END quicktour_print_offsets",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_offsets",
"end-before": "END print_offsets",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
and those are the indices that correspond to the emoji in the original
sentence:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START use_offsets",
"end-before": "END use_offsets",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_use_offsets",
"end-before": "END quicktour_use_offsets",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START use_offsets",
"end-before": "END use_offsets",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
### Post-processing
We might want our tokenizer to automatically add special tokens, like
`"[CLS]"` or `"[SEP]"`. To do this, we use a post-processor.
`TemplateProcessing` is the most
commonly used, you just have to specify a template for the processing of
single sentences and pairs of sentences, along with the special tokens
and their IDs.
When we built our tokenizer, we set `"[CLS]"` and `"[SEP]"` in positions 1
and 2 of our list of special tokens, so this should be their IDs. To
double-check, we can use the `Tokenizer.token_to_id` method:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START check_sep",
"end-before": "END check_sep",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_check_sep",
"end-before": "END quicktour_check_sep",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START check_sep",
"end-before": "END check_sep",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
Here is how we can set the post-processing to give us the traditional
BERT inputs:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START init_template_processing",
"end-before": "END init_template_processing",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_init_template_processing",
"end-before": "END quicktour_init_template_processing",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START init_template_processing",
"end-before": "END init_template_processing",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
Let's go over this snippet of code in more details. First we specify
the template for single sentences: those should have the form
`"[CLS] $A [SEP]"` where
`$A` represents our sentence.
Then, we specify the template for sentence pairs, which should have the
form `"[CLS] $A [SEP] $B [SEP]"` where
`$A` represents the first sentence and
`$B` the second one. The
`:1` added in the template represent the `type IDs` we want for each part of our input: it defaults
to 0 for everything (which is why we don't have
`$A:0`) and here we set it to 1 for the
tokens of the second sentence and the last `"[SEP]"` token.
Lastly, we specify the special tokens we used and their IDs in our
tokenizer's vocabulary.
To check out this worked properly, let's try to encode the same
sentence as before:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_special_tokens",
"end-before": "END print_special_tokens",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_special_tokens",
"end-before": "END quicktour_print_special_tokens",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_special_tokens",
"end-before": "END print_special_tokens",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
To check the results on a pair of sentences, we just pass the two
sentences to `Tokenizer.encode`:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_special_tokens_pair",
"end-before": "END print_special_tokens_pair",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_special_tokens_pair",
"end-before": "END quicktour_print_special_tokens_pair",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_special_tokens_pair",
"end-before": "END print_special_tokens_pair",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
You can then check the type IDs attributed to each token is correct with
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_type_ids",
"end-before": "END print_type_ids",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_type_ids",
"end-before": "END quicktour_print_type_ids",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_type_ids",
"end-before": "END print_type_ids",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
If you save your tokenizer with `Tokenizer.save`, the post-processor will be saved along.
### Encoding multiple sentences in a batch
To get the full speed of the 🤗 Tokenizers library, it's best to
process your texts by batches by using the
`Tokenizer.encode_batch` method:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START encode_batch",
"end-before": "END encode_batch",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_encode_batch",
"end-before": "END quicktour_encode_batch",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START encode_batch",
"end-before": "END encode_batch",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
The output is then a list of `Encoding`
objects like the ones we saw before. You can process together as many
texts as you like, as long as it fits in memory.
To process a batch of sentences pairs, pass two lists to the
`Tokenizer.encode_batch` method: the
list of sentences A and the list of sentences B:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START encode_batch_pair",
"end-before": "END encode_batch_pair",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_encode_batch_pair",
"end-before": "END quicktour_encode_batch_pair",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START encode_batch_pair",
"end-before": "END encode_batch_pair",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
When encoding multiple sentences, you can automatically pad the outputs
to the longest sentence present by using
`Tokenizer.enable_padding`, with the
`pad_token` and its ID (which we can
double-check the id for the padding token with
`Tokenizer.token_to_id` like before):
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START enable_padding",
"end-before": "END enable_padding",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_enable_padding",
"end-before": "END quicktour_enable_padding",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START enable_padding",
"end-before": "END enable_padding",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
We can set the `direction` of the padding
(defaults to the right) or a given `length` if we want to pad every sample to that specific number (here
we leave it unset to pad to the size of the longest text).
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_batch_tokens",
"end-before": "END print_batch_tokens",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_batch_tokens",
"end-before": "END quicktour_print_batch_tokens",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_batch_tokens",
"end-before": "END print_batch_tokens",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
In this case, the `attention mask` generated by the
tokenizer takes the padding into account:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_attention_mask",
"end-before": "END print_attention_mask",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_attention_mask",
"end-before": "END quicktour_print_attention_mask",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_attention_mask",
"end-before": "END print_attention_mask",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
## Pretrained
<tokenizerslangcontent>
<python>
### Using a pretrained tokenizer
You can load any tokenizer from the Hugging Face Hub as long as a
`tokenizer.json` file is available in the repository.
```python
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
```
### Importing a pretrained tokenizer from legacy vocabulary files
You can also import a pretrained tokenizer directly in, as long as you
have its vocabulary file. For instance, here is how to import the
classic pretrained BERT tokenizer:
```python
from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
```
as long as you have downloaded the file `bert-base-uncased-vocab.txt` with
```bash
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
```
</python>
</tokenizerslangcontent>

View File

@ -0,0 +1,116 @@
# Training from memory
In the [Quicktour](quicktour.html), we saw how to build and train a
tokenizer using text files, but we can actually use any Python Iterator.
In this section we'll see a few different ways of training our
tokenizer.
For all the examples listed below, we'll use the same [`~tokenizers.Tokenizer`] and
[`~tokenizers.trainers.Trainer`], built as
following:
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START init_tokenizer_trainer",
"end-before": "END init_tokenizer_trainer",
"dedent": 8}
</literalinclude>
This tokenizer is based on the [`~tokenizers.models.Unigram`] model. It
takes care of normalizing the input using the NFKC Unicode normalization
method, and uses a [`~tokenizers.pre_tokenizers.ByteLevel`] pre-tokenizer with the corresponding decoder.
For more information on the components used here, you can check
[here](components.html)
## The most basic way
As you probably guessed already, the easiest way to train our tokenizer
is by using a `List`{.interpreted-text role="obj"}:
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START train_basic",
"end-before": "END train_basic",
"dedent": 8}
</literalinclude>
Easy, right? You can use anything working as an iterator here, be it a
`List`{.interpreted-text role="obj"}, `Tuple`{.interpreted-text
role="obj"}, or a `np.Array`{.interpreted-text role="obj"}. Anything
works as long as it provides strings.
## Using the 🤗 Datasets library
An awesome way to access one of the many datasets that exist out there
is by using the 🤗 Datasets library. For more information about it, you
should check [the official documentation
here](https://huggingface.co/docs/datasets/).
Let's start by loading our dataset:
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START load_dataset",
"end-before": "END load_dataset",
"dedent": 8}
</literalinclude>
The next step is to build an iterator over this dataset. The easiest way
to do this is probably by using a generator:
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START def_batch_iterator",
"end-before": "END def_batch_iterator",
"dedent": 8}
</literalinclude>
As you can see here, for improved efficiency we can actually provide a
batch of examples used to train, instead of iterating over them one by
one. By doing so, we can expect performances very similar to those we
got while training directly from files.
With our iterator ready, we just need to launch the training. In order
to improve the look of our progress bars, we can specify the total
length of the dataset:
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START train_datasets",
"end-before": "END train_datasets",
"dedent": 8}
</literalinclude>
And that's it!
## Using gzip files
Since gzip files in Python can be used as iterators, it is extremely
simple to train on such files:
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START single_gzip",
"end-before": "END single_gzip",
"dedent": 8}
</literalinclude>
Now if we wanted to train from multiple gzip files, it wouldn't be much
harder:
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START multi_gzip",
"end-before": "END multi_gzip",
"dedent": 8}
</literalinclude>
And voilà!