Init new docs

2025-08-22 16:25:30 +00:00 · 2022-04-18 09:37:14 +02:00
parent 66c9af26f6
commit 6eda286ab1
19 changed files with 2269 additions and 0 deletions
--- a/docs/source-doc-builder/_toctree.yml
+++ b/docs/source-doc-builder/_toctree.yml
@ -0,0 +1,40 @@
 - sections: 
  - local: index
    title: 🤗 Tokenizers
  - local: quicktour
    title: Quicktour
  - local: installation
    title: Installation
  - local: pipeline
    title: The tokenization pipeline
  - local: components
    title: Components
  - local: training_from_memory
    title: Training from memory
  title: Getting started
 - sections:
  - local: api/input-sequences
    title: Input Sequences
  - local: api/encode-inputs
    title: Encode Inputs
  - local: api/tokenizer
    title: Tokenizer
  - local: api/encoding
    title: Encoding
  - local: api/added-tokens
    title: Added Tokens
  - local: api/models
    title: Models
  - local: api/normalizers
    title: Normalizers
  - local: api/pre-tokenizers
    title: Pre-tokenizers
  - local: api/post-processors
    title: Post-processors
  - local: api/trainers
    title: Trainers
  - local: api/decoders
    title: Decoders
  - local: api/visualizer
    title: Visualizer
  title: API
--- a/docs/source-doc-builder/api/added-tokens.mdx
+++ b/docs/source-doc-builder/api/added-tokens.mdx
@ -0,0 +1,15 @@
 # Added Tokens
 <tokenizerslangcontent>
 <python>
 ## AddedToken[[tokenizers.AddedToken]]
 [[autodoc]] tokenizers.AddedToken
 </python>
 <rust>
 The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
 </rust>
 <node>
 The node API has not been documented yet.
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/api/decoders.mdx
+++ b/docs/source-doc-builder/api/decoders.mdx
@ -0,0 +1,31 @@
 # Decoders
 <tokenizerslangcontent>
 <python>
 ## BPEDecoder[[tokenizers.decoders.BPEDecoder]]
 [[autodoc]] tokenizers.decoders.BPEDecoder
 ## ByteLevel[[tokenizers.decoders.ByteLevel]]
 [[autodoc]] tokenizers.decoders.ByteLevel
 ## CTC[[tokenizers.decoders.CTC]]
 [[autodoc]] tokenizers.decoders.CTC
 ## Metaspace[[tokenizers.decoders.Metaspace]]
 [[autodoc]] tokenizers.decoders.Metaspace
 ## WordPiece[[tokenizers.decoders.WordPiece]]
 [[autodoc]] tokenizers.decoders.WordPiece
 </python>
 <rust>
 The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
 </rust>
 <node>
 The node API has not been documented yet.
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/api/encode-inputs.mdx
+++ b/docs/source-doc-builder/api/encode-inputs.mdx
@ -0,0 +1,48 @@
 # Encode Inputs
 <tokenizerslangcontent>
 <python>
 These types represent all the different kinds of input that a [`~tokenizers.Tokenizer`] accepts
 when using [`~tokenizers.Tokenizer.encode_batch`].
 ## TextEncodeInput[[[[tokenizers.TextEncodeInput]]]]
 <code>tokenizers.TextEncodeInput</code>
 Represents a textual input for encoding. Can be either:
 - A single sequence: [TextInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.TextInputSequence)
 - A pair of sequences:
  - A Tuple of [TextInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.TextInputSequence)
  - Or a List of [TextInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.TextInputSequence) of size 2
 alias of `Union[str, Tuple[str, str], List[str]]`.
 ## PreTokenizedEncodeInput[[[[tokenizers.PreTokenizedEncodeInput]]]]
 <code>tokenizers.PreTokenizedEncodeInput</code>
 Represents a pre-tokenized input for encoding. Can be either:
 - A single sequence: [PreTokenizedInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.PreTokenizedInputSequence)
 - A pair of sequences:
  - A Tuple of [PreTokenizedInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.PreTokenizedInputSequence)
  - Or a List of [PreTokenizedInputSequence](/docs/tokenizers/api/input-sequences#tokenizers.PreTokenizedInputSequence) of size 2
 alias of `Union[List[str], Tuple[str], Tuple[Union[List[str], Tuple[str]], Union[List[str], Tuple[str]]], List[Union[List[str], Tuple[str]]]]`.
 ## EncodeInput[[[[tokenizers.EncodeInput]]]]
 <code>tokenizers.EncodeInput</code>
 Represents all the possible types of input for encoding. Can be:
 - When `is_pretokenized=False`: [TextEncodeInput](#tokenizers.TextEncodeInput)
 - When `is_pretokenized=True`: [PreTokenizedEncodeInput](#tokenizers.PreTokenizedEncodeInput)
 alias of `Union[str, Tuple[str, str], List[str], Tuple[str], Tuple[Union[List[str], Tuple[str]], Union[List[str], Tuple[str]]], List[Union[List[str], Tuple[str]]]]`.
 </python>
 <rust>
 The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
 </rust>
 <node>
 The node API has not been documented yet.
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/api/encoding.mdx
+++ b/docs/source-doc-builder/api/encoding.mdx
@ -0,0 +1,15 @@
 # Encoding
 <tokenizerslangcontent>
 <python>
 ## Encoding[[tokenizers.Encoding]]
 [[autodoc]] tokenizers.Encoding
 </python>
 <rust>
 The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
 </rust>
 <node>
 The node API has not been documented yet.
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/api/input-sequences.mdx
+++ b/docs/source-doc-builder/api/input-sequences.mdx
@ -0,0 +1,41 @@
 # Input Sequences
 <tokenizerslangcontent>
 <python>
 These types represent all the different kinds of sequence that can be used as input of a Tokenizer.
 Globally, any sequence can be either a string or a list of strings, according to the operating
 mode of the tokenizer: `raw text` vs `pre-tokenized`.
 ## TextInputSequence[[tokenizers.TextInputSequence]]
 <code>tokenizers.TextInputSequence</code>
 A `str` that represents an input sequence
 ## PreTokenizedInputSequence[[tokenizers.PreTokenizedInputSequence]]
 <code>tokenizers.PreTokenizedInputSequence</code>
 A pre-tokenized input sequence. Can be one of:
 - A `List` of `str`
 - A `Tuple` of `str`
 alias of `Union[List[str], Tuple[str]]`.
 ## InputSequence[[tokenizers.InputSequence]]
 <code>tokenizers.InputSequence</code>
 Represents all the possible types of input sequences for encoding. Can be:
 - When `is_pretokenized=False`: [TextInputSequence](#tokenizers.TextInputSequence)
 - When `is_pretokenized=True`: [PreTokenizedInputSequence](#tokenizers.PreTokenizedInputSequence)
 alias of `Union[str, List[str], Tuple[str]]`.
 </python>
 <rust>
 The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
 </rust>
 <node>
 The node API has not been documented yet.
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/api/models.mdx
+++ b/docs/source-doc-builder/api/models.mdx
@ -0,0 +1,31 @@
 # Models
 <tokenizerslangcontent>
 <python>
 ## BPE[[tokenizers.models.BPE]]
 [[autodoc]] tokenizers.models.BPE
 ## Model[[tokenizers.models.Model]]
 [[autodoc]] tokenizers.models.Model
 ## Unigram[[tokenizers.models.Unigram]]
 [[autodoc]] tokenizers.models.Unigram
 ## WordLevel[[tokenizers.models.WordLevel]]
 [[autodoc]] tokenizers.models.WordLevel
 ## WordPiece[[tokenizers.models.WordPiece]]
 [[autodoc]] tokenizers.models.WordPiece
 </python>
 <rust>
 The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
 </rust>
 <node>
 The node API has not been documented yet.
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/api/normalizers.mdx
+++ b/docs/source-doc-builder/api/normalizers.mdx
@ -0,0 +1,63 @@
 # Normalizers
 <tokenizerslangcontent>
 <python>
 ## BertNormalizer[[tokenizers.normalizers.BertNormalizer]]
 [[autodoc]] tokenizers.normalizers.BertNormalizer
 ## Lowercase[[tokenizers.normalizers.Lowercase]]
 [[autodoc]] tokenizers.normalizers.Lowercase
 ## NFC[[tokenizers.normalizers.NFC]]
 [[autodoc]] tokenizers.normalizers.NFC
 ## NFD[[tokenizers.normalizers.NFD]]
 [[autodoc]] tokenizers.normalizers.NFD
 ## NFKC[[tokenizers.normalizers.NFKC]]
 [[autodoc]] tokenizers.normalizers.NFKC
 ## NFKD[[tokenizers.normalizers.NFKD]]
 [[autodoc]] tokenizers.normalizers.NFKD
 ## Nmt[[tokenizers.normalizers.Nmt]]
 [[autodoc]] tokenizers.normalizers.Nmt
 ## Normalizer[[tokenizers.normalizers.Normalizer]]
 [[autodoc]] tokenizers.normalizers.Normalizer
 ## Precompiled[[tokenizers.normalizers.Precompiled]]
 [[autodoc]] tokenizers.normalizers.Precompiled
 ## Replace[[tokenizers.normalizers.Replace]]
 [[autodoc]] tokenizers.normalizers.Replace
 ## Sequence[[tokenizers.normalizers.Sequence]]
 [[autodoc]] tokenizers.normalizers.Sequence
 ## Strip[[tokenizers.normalizers.Strip]]
 [[autodoc]] tokenizers.normalizers.Strip
 ## StripAccents[[tokenizers.normalizers.StripAccents]]
 [[autodoc]] tokenizers.normalizers.StripAccents
 </python>
 <rust>
 The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
 </rust>
 <node>
 The node API has not been documented yet.
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/api/post-processors.mdx
+++ b/docs/source-doc-builder/api/post-processors.mdx
@ -0,0 +1,27 @@
 # Post-processors
 <tokenizerslangcontent>
 <python>
 ## BertProcessing[[tokenizers.processors.BertProcessing]]
 [[autodoc]] tokenizers.processors.BertProcessing
 ## ByteLevel[[tokenizers.processors.ByteLevel]]
 [[autodoc]] tokenizers.processors.ByteLevel
 ## RobertaProcessing[[tokenizers.processors.RobertaProcessing]]
 [[autodoc]] tokenizers.processors.RobertaProcessing
 ## TemplateProcessing[[tokenizers.processors.TemplateProcessing]]
 [[autodoc]] tokenizers.processors.TemplateProcessing
 </python>
 <rust>
 The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
 </rust>
 <node>
 The node API has not been documented yet.
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/api/pre-tokenizers.mdx
+++ b/docs/source-doc-builder/api/pre-tokenizers.mdx
@ -0,0 +1,59 @@
 # Pre-tokenizers
 <tokenizerslangcontent>
 <python>
 ## BertPreTokenizer[[tokenizers.pre_tokenizers.BertPreTokenizer]]
 [[autodoc]] tokenizers.pre_tokenizers.BertPreTokenizer
 ## ByteLevel[[tokenizers.pre_tokenizers.ByteLevel]]
 [[autodoc]] tokenizers.pre_tokenizers.ByteLevel
 ## CharDelimiterSplit[[tokenizers.pre_tokenizers.CharDelimiterSplit]]
 [[autodoc]] tokenizers.pre_tokenizers.CharDelimiterSplit
 ## Digits[[tokenizers.pre_tokenizers.Digits]]
 [[autodoc]] tokenizers.pre_tokenizers.Digits
 ## Metaspace[[tokenizers.pre_tokenizers.Metaspace]]
 [[autodoc]] tokenizers.pre_tokenizers.Metaspace
 ## PreTokenizer[[tokenizers.pre_tokenizers.PreTokenizer]]
 [[autodoc]] tokenizers.pre_tokenizers.PreTokenizer
 ## Punctuation[[tokenizers.pre_tokenizers.Punctuation]]
 [[autodoc]] tokenizers.pre_tokenizers.Punctuation
 ## Sequence[[tokenizers.pre_tokenizers.Sequence]]
 [[autodoc]] tokenizers.pre_tokenizers.Sequence
 ## Split[[tokenizers.pre_tokenizers.Split]]
 [[autodoc]] tokenizers.pre_tokenizers.Split
 ## UnicodeScripts[[tokenizers.pre_tokenizers.UnicodeScripts]]
 [[autodoc]] tokenizers.pre_tokenizers.UnicodeScripts
 ## Whitespace[[tokenizers.pre_tokenizers.Whitespace]]
 [[autodoc]] tokenizers.pre_tokenizers.Whitespace
 ## WhitespaceSplit[[tokenizers.pre_tokenizers.WhitespaceSplit]]
 [[autodoc]] tokenizers.pre_tokenizers.WhitespaceSplit
 </python>
 <rust>
 The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
 </rust>
 <node>
 The node API has not been documented yet.
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/api/tokenizer.mdx
+++ b/docs/source-doc-builder/api/tokenizer.mdx
@ -0,0 +1,15 @@
 # Tokenizer
 <tokenizerslangcontent>
 <python>
 ## Tokenizer[[tokenizers.Tokenizer]]
 [[autodoc]] tokenizers.Tokenizer
 </python>
 <rust>
 The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
 </rust>
 <node>
 The node API has not been documented yet.
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/api/trainers.mdx
+++ b/docs/source-doc-builder/api/trainers.mdx
@ -0,0 +1,27 @@
 # Trainers
 <tokenizerslangcontent>
 <python>
 ## BpeTrainer[[tokenizers.trainers.BpeTrainer]]
 [[autodoc]] tokenizers.trainers.BpeTrainer
 ## UnigramTrainer[[tokenizers.trainers.UnigramTrainer]]
 [[autodoc]] tokenizers.trainers.UnigramTrainer
 ## WordLevelTrainer[[tokenizers.trainers.WordLevelTrainer]]
 [[autodoc]] tokenizers.trainers.WordLevelTrainer
 ## WordPieceTrainer[[tokenizers.trainers.WordPieceTrainer]]
 [[autodoc]] tokenizers.trainers.WordPieceTrainer
 </python>
 <rust>
 The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
 </rust>
 <node>
 The node API has not been documented yet.
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/api/visualizer.mdx
+++ b/docs/source-doc-builder/api/visualizer.mdx
@ -0,0 +1,20 @@
 # Visualizer
 <tokenizerslangcontent>
 <python>
 ## Annotation[[tokenizers.tools.Annotation]]
 [[autodoc]] tokenizers.tools.Annotation
 ## EncodingVisualizer[[tokenizers.tools.EncodingVisualizer]]
 [[autodoc]] tokenizers.tools.EncodingVisualizer
    -  __call__
 </python>
 <rust>
 The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
 </rust>
 <node>
 The node API has not been documented yet.
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/components.mdx
+++ b/docs/source-doc-builder/components.mdx
@ -0,0 +1,152 @@
 # Components
 When building a Tokenizer, you can attach various types of components to
 this Tokenizer in order to customize its behavior. This page lists most
 provided components.
 ## Normalizers
 A `Normalizer` is in charge of pre-processing the input string in order
 to normalize it as relevant for a given use case. Some common examples
 of normalization are the Unicode normalization algorithms (NFD, NFKD,
 NFC & NFKC), lowercasing etc... The specificity of `tokenizers` is that
 we keep track of the alignment while normalizing. This is essential to
 allow mapping from the generated tokens back to the input text.
 The `Normalizer` is optional.
 <tokenizerslangcontent>
 <python>
 | Name | Description | Example |
 | :--- | :--- | :--- |
 | NFD | NFD unicode normalization |  |
 | NFKD | NFKD unicode normalization |  |
 | NFC | NFC unicode normalization |  |
 | NFKC | NFKC unicode normalization |  |
 | Lowercase | Replaces all uppercase to lowercase | Input: `HELLO ὈΔΥΣΣΕΎΣ` <br> Output: `hello`ὀδυσσεύς`  |
 | Strip | Removes all whitespace characters on the specified sides (left, right or both) of the input | Input: `"`hi`"` <br> Output: `"hi"`  |
 | StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Ouput: `e`  |
 | Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Ouput: `"benene"`  |
 | BertNormalizer | Provides an implementation of the Normalizer used in the original BERT. Options that can be set are: <ul> <li>clean_text</li> <li>handle_chinese_chars</li> <li>strip_accents</li> <li>lowercase</li> </ul>  |  |
 | Sequence | Composes multiple normalizers that will run in the provided order | `Sequence([NFKC(), Lowercase()])` |
 </python>
 <rust>
 | Name | Description | Example |
 | :--- | :--- | :--- |
 | NFD | NFD unicode normalization |  |
 | NFKD | NFKD unicode normalization |  |
 | NFC | NFC unicode normalization |  |
 | NFKC | NFKC unicode normalization |  |
 | Lowercase | Replaces all uppercase to lowercase | Input: `HELLO ὈΔΥΣΣΕΎΣ` <br> Output: `hello`ὀδυσσεύς`  |
 | Strip | Removes all whitespace characters on the specified sides (left, right or both) of the input | Input: `"`hi`"` <br> Output: `"hi"`  |
 | StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Ouput: `e`  |
 | Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Ouput: `"benene"`  |
 | BertNormalizer | Provides an implementation of the Normalizer used in the original BERT. Options that can be set are: <ul> <li>clean_text</li> <li>handle_chinese_chars</li> <li>strip_accents</li> <li>lowercase</li> </ul>  |  |
 | Sequence | Composes multiple normalizers that will run in the provided order | `Sequence::new(vec![NFKC, Lowercase])` |
 </rust>
 <node>
 | Name | Description | Example |
 | :--- | :--- | :--- |
 | NFD | NFD unicode normalization |  |
 | NFKD | NFKD unicode normalization |  |
 | NFC | NFC unicode normalization |  |
 | NFKC | NFKC unicode normalization |  |
 | Lowercase | Replaces all uppercase to lowercase | Input: `HELLO ὈΔΥΣΣΕΎΣ` <br> Output: `hello`ὀδυσσεύς`  |
 | Strip | Removes all whitespace characters on the specified sides (left, right or both) of the input | Input: `"`hi`"` <br> Output: `"hi"`  |
 | StripAccents | Removes all accent symbols in unicode (to be used with NFD for consistency) | Input: `é` <br> Ouput: `e`  |
 | Replace | Replaces a custom string or regexp and changes it with given content | `Replace("a", "e")` will behave like this: <br> Input: `"banana"` <br> Ouput: `"benene"`  |
 | BertNormalizer | Provides an implementation of the Normalizer used in the original BERT. Options that can be set are: <ul> <li>cleanText</li> <li>handleChineseChars</li> <li>stripAccents</li> <li>lowercase</li> </ul>  |  |
 | Sequence | Composes multiple normalizers that will run in the provided order | |
 </node>
 </tokenizerslangcontent>
 ## Pre-tokenizers
 The `PreTokenizer` takes care of splitting the input according to a set
 of rules. This pre-processing lets you ensure that the underlying
 `Model` does not build tokens across multiple "splits". For example if
 you don't want to have whitespaces inside a token, then you can have a
 `PreTokenizer` that splits on these whitespaces.
 You can easily combine multiple `PreTokenizer` together using a
 `Sequence` (see below). The `PreTokenizer` is also allowed to modify the
 string, just like a `Normalizer` does. This is necessary to allow some
 complicated algorithms that require to split before normalizing (e.g.
 the ByteLevel)
 <tokenizerslangcontent>
 <python>
 | Name | Description | Example |
 | :--- | :--- | :--- |
 | ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Ouput: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"`  |
 | Whitespace | Splits on word boundaries (using the following regular expression: `\w+&#124;[^\w\s]+` | Input: `"Hello there!"` <br> Output: `"Hello", "there", "!"`  |
 | WhitespaceSplit | Splits on any whitespace character | Input: `"Hello there!"` <br> Output: `"Hello", "there!"`  |
 | Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Ouput: `"Hello", "?"`  |
 | Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Ouput: `"Hello", "▁there"`  |
 | CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Ouput: `"Hello", "there"`  |
 | Digits | Splits the numbers from any other characters. | Input: `"Hello123there"` <br>  Output: ``"Hello", "123", "there"``  |
 | Split | Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary. <ul> <li>pattern should be either a custom string or regexp.</li> <li>behavior should be one of: <ul><li>removed</li><li>isolated</li><li>merged_with_previous</li><li>merged_with_next</li><li>contiguous</li></ul></li> <li>invert should be a boolean flag.</li> </ul> | Example with pattern = ` `, behavior = `"isolated"`, invert = `False`: <br> Input: `"Hello, how are you?"` <br> Output: `"Hello,", " ", "how", " ", "are", " ", "you?"` |
 | Sequence | Lets you compose multiple `PreTokenizer` that will be run in the given order | `Sequence([Punctuation(), WhitespaceSplit()])` |
 </python>
 <rust>
 | Name | Description | Example |
 | :--- | :--- | :--- |
 | ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Ouput: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"`  |
 | Whitespace | Splits on word boundaries (using the following regular expression: `\w+&#124;[^\w\s]+` | Input: `"Hello there!"` <br> Output: `"Hello", "there", "!"`  |
 | WhitespaceSplit | Splits on any whitespace character | Input: `"Hello there!"` <br> Output: `"Hello", "there!"`  |
 | Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Ouput: `"Hello", "?"`  |
 | Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Ouput: `"Hello", "▁there"`  |
 | CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Ouput: `"Hello", "there"`  |
 | Digits | Splits the numbers from any other characters. | Input: `"Hello123there"` <br>  Output: ``"Hello", "123", "there"``  |
 | Split | Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary. <ul> <li>pattern should be either a custom string or regexp.</li> <li>behavior should be one of: <ul><li>Removed</li><li>Isolated</li><li>MergedWithPrevious</li><li>MergedWithNext</li><li>Contiguous</li></ul></li> <li>invert should be a boolean flag.</li> </ul> | Example with pattern = ` `, behavior = `"isolated"`, invert = `False`: <br> Input: `"Hello, how are you?"` <br> Output: `"Hello,", " ", "how", " ", "are", " ", "you?"` |
 | Sequence | Lets you compose multiple `PreTokenizer` that will be run in the given order | `Sequence::new(vec![Punctuation, WhitespaceSplit])` |
 </rust>
 <node>
 | Name | Description | Example |
 | :--- | :--- | :--- |
 | ByteLevel | Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties: <ul> <li>Since it maps on bytes, a tokenizer using this only requires **256** characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.</li> <li>A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)</li> <li>For non ascii characters, it gets completely unreadable, but it works nonetheless!</li> </ul> | Input: `"Hello my friend, how are you?"` <br> Ouput: `"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"`  |
 | Whitespace | Splits on word boundaries (using the following regular expression: `\w+&#124;[^\w\s]+` | Input: `"Hello there!"` <br> Output: `"Hello", "there", "!"`  |
 | WhitespaceSplit | Splits on any whitespace character | Input: `"Hello there!"` <br> Output: `"Hello", "there!"`  |
 | Punctuation | Will isolate all punctuation characters | Input: `"Hello?"` <br> Ouput: `"Hello", "?"`  |
 | Metaspace | Splits on whitespaces and replaces them with a special char “▁” (U+2581) | Input: `"Hello there"` <br> Ouput: `"Hello", "▁there"`  |
 | CharDelimiterSplit | Splits on a given character | Example with `x`: <br> Input: `"Helloxthere"` <br> Ouput: `"Hello", "there"`  |
 | Digits | Splits the numbers from any other characters. | Input: `"Hello123there"` <br>  Output: ``"Hello", "123", "there"``  |
 | Split | Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary. <ul> <li>pattern should be either a custom string or regexp.</li> <li>behavior should be one of: <ul><li>removed</li><li>isolated</li><li>mergedWithPrevious</li><li>mergedWithNext</li><li>contiguous</li></ul></li> <li>invert should be a boolean flag.</li> </ul> | Example with pattern = ` `, behavior = `"isolated"`, invert = `False`: <br> Input: `"Hello, how are you?"` <br> Output: `"Hello,", " ", "how", " ", "are", " ", "you?"` |
 | Sequence | Lets you compose multiple `PreTokenizer` that will be run in the given order | |
 </node>
 </tokenizerslangcontent>
 ## Models
 Models are the core algorithms used to actually tokenize, and therefore,
 they are the only mandatory component of a Tokenizer.
 | Name | Description |
 | :--- | :--- |
 | WordLevel | This is the “classic” tokenization algorithm. It let’s you simply map words to IDs without anything fancy. This has the advantage of being really simple to use and understand, but it requires extremely large vocabularies for a good coverage. Using this `Model` requires the use of a `PreTokenizer`. No choice will be made by this model directly, it simply maps input tokens to IDs.  |
 | BPE | One of the most popular subword tokenization algorithm. The Byte-Pair-Encoding works by starting with characters, while merging those that are the most frequently seen together, thus creating new tokens. It then works iteratively to build new tokens out of the most frequent pairs it sees in a corpus. BPE is able to build words it has never seen by using multiple subword tokens, and thus requires smaller vocabularies, with less chances of having “unk” (unknown) tokens.  |
 | WordPiece | This is a subword tokenization algorithm quite similar to BPE, used mainly by Google in models like BERT. It uses a greedy algorithm, that tries to build long words first, splitting in multiple tokens when entire words don’t exist in the vocabulary. This is different from BPE that starts from characters, building bigger tokens as possible. It uses the famous `##` prefix to identify tokens that are part of a word (ie not starting a word).  |
 | Unigram | Unigram is also a subword tokenization algorithm, and works by trying to identify the best set of subword tokens to maximize the probability for a given sentence. This is different from BPE in the way that this is not deterministic based on a set of rules applied sequentially. Instead Unigram will be able to compute multiple ways of tokenizing, while choosing the most probable one. |
 ## Post-Processors
 After the whole pipeline, we sometimes want to insert some special
 tokens before feed a tokenized string into a model like "[CLS] My
 horse is amazing [SEP]". The `PostProcessor` is the component doing
 just that.
 | Name | Description | Example |
 | :--- | :--- | :--- |
 | TemplateProcessing | Let’s you easily template the post processing, adding special tokens, and specifying the `type_id` for each sequence/special token. The template is given two strings representing the single sequence and the pair of sequences, as well as a set of special tokens to use. | Example, when specifying a template with these values:<br> <ul> <li> single: `"[CLS] $A [SEP]"` </li> <li> pair: `"[CLS] $A [SEP] $B [SEP]"` </li> <li> special tokens: <ul> <li>`"[CLS]"`</li> <li>`"[SEP]"`</li> </ul> </li> </ul> <br> Input: `("I like this", "but not this")` <br> Output: `"[CLS] I like this [SEP] but not this [SEP]"` |
 ## Decoders
 The Decoder knows how to go from the IDs used by the Tokenizer, back to
 a readable piece of text. Some `Normalizer` and `PreTokenizer` use
 special characters or identifiers that need to be reverted for example.
 | Name | Description |
 | :--- | :--- |
 | ByteLevel | Reverts the ByteLevel PreTokenizer. This PreTokenizer encodes at the byte-level, using a set of visible Unicode characters to represent each byte, so we need a Decoder to revert this process and get something readable again. |
 | Metaspace | Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifer `▁` to identify whitespaces, and so this Decoder helps with decoding these. |
 | WordPiece | Reverts the WordPiece Model. This model uses a special identifier `##` for continuing subwords, and so this Decoder helps with decoding these. |
--- a/docs/source-doc-builder/index.mdx
+++ b/docs/source-doc-builder/index.mdx
@ -0,0 +1,19 @@
 <!-- DISABLE-FRONTMATTER-SECTIONS -->
 # Tokenizers
 Fast State-of-the-art tokenizers, optimized for both research and
 production
 [🤗 Tokenizers](https://github.com/huggingface/tokenizers) provides an
 implementation of today's most used tokenizers, with a focus on
 performance and versatility. These tokenizers are also used in [🤗 Transformers](https://github.com/huggingface/transformers).
 # Main features:
 - Train new vocabularies and tokenize, using today's most used tokenizers.
 - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
 - Easy to use, but also extremely versatile.
 - Designed for both research and production.
 - Full alignment tracking. Even with destructive normalization, it's always possible to get the part of the original sentence that corresponds to any token.
 - Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.
--- a/docs/source-doc-builder/installation.mdx
+++ b/docs/source-doc-builder/installation.mdx
@ -0,0 +1,89 @@
 # Installation
 <tokenizerslangcontent>
 <python>
 🤗 Tokenizers is tested on Python 3.5+.
 You should install 🤗 Tokenizers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're
 unfamiliar with Python virtual environments, check out the [user
 guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
 Create a virtual environment with the version of Python you're going to
 use and activate it.
 ## Installation with pip
 🤗 Tokenizers can be installed using pip as follows:
 ```bash
 pip install tokenizers
 ```
 ## Installation from sources
 To use this method, you need to have the Rust language installed. You
 can follow [the official
 guide](https://www.rust-lang.org/learn/get-started) for more
 information.
 If you are using a unix based OS, the installation should be as simple
 as running:
 ```bash
 curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
 ```
 Or you can easiy update it with the following command:
 ```bash
 rustup update
 ```
 Once rust is installed, we can start retrieving the sources for 🤗
 Tokenizers:
 ```bash
 git clone https://github.com/huggingface/tokenizers
 ```
 Then we go into the python bindings folder:
 ```bash
 cd tokenizers/bindings/python
 ```
 At this point you should have your [virtual environment]() already
 activated. In order to compile 🤗 Tokenizers, you need to install the
 Python package `setuptools_rust`:
 ```bash
 pip install setuptools_rust
 ```
 Then you can have 🤗 Tokenizers compiled and installed in your virtual
 environment with the following command:
 ```bash
 python setup.py install
 ```
 </python>
 <rust>
 ## Crates.io
 🤗 Tokenizers is available on [crates.io](https://crates.io/crates/tokenizers).
 You just need to add it to your `Cargo.toml`:
 ```bash
 tokenizers = "0.10"
 ```
 </rust>
 <node>
 ## Installation with npm
 You can simply install 🤗 Tokenizers with npm using:
 ```bash
 npm install tokenizers
 ```
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/pipeline.mdx
+++ b/docs/source-doc-builder/pipeline.mdx
@ -0,0 +1,623 @@
 # The tokenization pipeline
 When calling `Tokenizer.encode` or
 `Tokenizer.encode_batch`, the input
 text(s) go through the following pipeline:
 -   `normalization`
 -   `pre-tokenization`
 -   `model`
 -   `post-processing`
 We'll see in details what happens during each of those steps in detail,
 as well as when you want to `decode <decoding>` some token ids, and how the 🤗 Tokenizers library allows you
 to customize each of those steps to your needs. If you're already
 familiar with those steps and want to learn by seeing some code, jump to
 `our BERT from scratch example <example>`.
 For the examples that require a `Tokenizer` we will use the tokenizer we trained in the
 `quicktour`, which you can load with:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START reload_tokenizer",
 "end-before": "END reload_tokenizer",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START pipeline_reload_tokenizer",
 "end-before": "END pipeline_reload_tokenizer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START reload_tokenizer",
 "end-before": "END reload_tokenizer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 ## Normalization
 Normalization is, in a nutshell, a set of operations you apply to a raw
 string to make it less random or "cleaner". Common operations include
 stripping whitespace, removing accented characters or lowercasing all
 text. If you're familiar with [Unicode
 normalization](https://unicode.org/reports/tr15), it is also a very
 common normalization operation applied in most tokenizers.
 Each normalization operation is represented in the 🤗 Tokenizers library
 by a `Normalizer`, and you can combine
 several of those by using a `normalizers.Sequence`. Here is a normalizer applying NFD Unicode normalization
 and removing accents as an example:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START setup_normalizer",
 "end-before": "END setup_normalizer",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START pipeline_setup_normalizer",
 "end-before": "END pipeline_setup_normalizer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START setup_normalizer",
 "end-before": "END setup_normalizer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 You can manually test that normalizer by applying it to any string:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START test_normalizer",
 "end-before": "END test_normalizer",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START pipeline_test_normalizer",
 "end-before": "END pipeline_test_normalizer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START test_normalizer",
 "end-before": "END test_normalizer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 When building a `Tokenizer`, you can
 customize its normalizer by just changing the corresponding attribute:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START replace_normalizer",
 "end-before": "END replace_normalizer",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START pipeline_replace_normalizer",
 "end-before": "END pipeline_replace_normalizer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START replace_normalizer",
 "end-before": "END replace_normalizer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 Of course, if you change the way a tokenizer applies normalization, you
 should probably retrain it from scratch afterward.
 ## Pre-Tokenization
 Pre-tokenization is the act of splitting a text into smaller objects
 that give an upper bound to what your tokens will be at the end of
 training. A good way to think of this is that the pre-tokenizer will
 split your text into "words" and then, your final tokens will be parts
 of those words.
 An easy way to pre-tokenize inputs is to split on spaces and
 punctuations, which is done by the
 `pre_tokenizers.Whitespace`
 pre-tokenizer:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START setup_pre_tokenizer",
 "end-before": "END setup_pre_tokenizer",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START pipeline_setup_pre_tokenizer",
 "end-before": "END pipeline_setup_pre_tokenizer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START setup_pre_tokenizer",
 "end-before": "END setup_pre_tokenizer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 The output is a list of tuples, with each tuple containing one word and
 its span in the original sentence (which is used to determine the final
 `offsets` of our `Encoding`). Note that splitting on
 punctuation will split contractions like `"I'm"` in this example.
 You can combine together any `PreTokenizer` together. For instance, here is a pre-tokenizer that will
 split on space, punctuation and digits, separating numbers in their
 individual digits:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START combine_pre_tokenizer",
 "end-before": "END combine_pre_tokenizer",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START pipeline_combine_pre_tokenizer",
 "end-before": "END pipeline_combine_pre_tokenizer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START combine_pre_tokenizer",
 "end-before": "END combine_pre_tokenizer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 As we saw in the `quicktour`, you can
 customize the pre-tokenizer of a `Tokenizer` by just changing the corresponding attribute:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START replace_pre_tokenizer",
 "end-before": "END replace_pre_tokenizer",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START pipeline_replace_pre_tokenizer",
 "end-before": "END pipeline_replace_pre_tokenizer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START replace_pre_tokenizer",
 "end-before": "END replace_pre_tokenizer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 Of course, if you change the way the pre-tokenizer, you should probably
 retrain your tokenizer from scratch afterward.
 ## Model
 Once the input texts are normalized and pre-tokenized, the
 `Tokenizer` applies the model on the
 pre-tokens. This is the part of the pipeline that needs training on your
 corpus (or that has been trained if you are using a pretrained
 tokenizer).
 The role of the model is to split your "words" into tokens, using the
 rules it has learned. It's also responsible for mapping those tokens to
 their corresponding IDs in the vocabulary of the model.
 This model is passed along when intializing the
 `Tokenizer` so you already know how to
 customize this part. Currently, the 🤗 Tokenizers library supports:
 -   `models.BPE`
 -   `models.Unigram`
 -   `models.WordLevel`
 -   `models.WordPiece`
 For more details about each model and its behavior, you can check
 [here](components.html#models)
 ## Post-Processing
 Post-processing is the last step of the tokenization pipeline, to
 perform any additional transformation to the
 `Encoding` before it's returned, like
 adding potential special tokens.
 As we saw in the quick tour, we can customize the post processor of a
 `Tokenizer` by setting the
 corresponding attribute. For instance, here is how we can post-process
 to make the inputs suitable for the BERT model:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START setup_processor",
 "end-before": "END setup_processor",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START pipeline_setup_processor",
 "end-before": "END pipeline_setup_processor",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START setup_processor",
 "end-before": "END setup_processor",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 Note that contrarily to the pre-tokenizer or the normalizer, you don't
 need to retrain a tokenizer after changing its post-processor.
 ## All together: a BERT tokenizer from scratch
 Let's put all those pieces together to build a BERT tokenizer. First,
 BERT relies on WordPiece, so we instantiate a new
 `Tokenizer` with this model:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START bert_setup_tokenizer",
 "end-before": "END bert_setup_tokenizer",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START bert_setup_tokenizer",
 "end-before": "END bert_setup_tokenizer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START bert_setup_tokenizer",
 "end-before": "END bert_setup_tokenizer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 Then we know that BERT preprocesses texts by removing accents and
 lowercasing. We also use a unicode normalizer:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START bert_setup_normalizer",
 "end-before": "END bert_setup_normalizer",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START bert_setup_normalizer",
 "end-before": "END bert_setup_normalizer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START bert_setup_normalizer",
 "end-before": "END bert_setup_normalizer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 The pre-tokenizer is just splitting on whitespace and punctuation:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START bert_setup_pre_tokenizer",
 "end-before": "END bert_setup_pre_tokenizer",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START bert_setup_pre_tokenizer",
 "end-before": "END bert_setup_pre_tokenizer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START bert_setup_pre_tokenizer",
 "end-before": "END bert_setup_pre_tokenizer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 And the post-processing uses the template we saw in the previous
 section:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START bert_setup_processor",
 "end-before": "END bert_setup_processor",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START bert_setup_processor",
 "end-before": "END bert_setup_processor",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START bert_setup_processor",
 "end-before": "END bert_setup_processor",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 We can use this tokenizer and train on it on wikitext like in the
 `quicktour`:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START bert_train_tokenizer",
 "end-before": "END bert_train_tokenizer",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START bert_train_tokenizer",
 "end-before": "END bert_train_tokenizer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START bert_train_tokenizer",
 "end-before": "END bert_train_tokenizer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 ## Decoding
 On top of encoding the input texts, a `Tokenizer` also has an API for decoding, that is converting IDs
 generated by your model back to a text. This is done by the methods
 `Tokenizer.decode` (for one predicted text) and `Tokenizer.decode_batch` (for a batch of predictions).
 The [decoder]{.title-ref} will first convert the IDs back to tokens
 (using the tokenizer's vocabulary) and remove all special tokens, then
 join those tokens with spaces:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START test_decoding",
 "end-before": "END test_decoding",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START pipeline_test_decoding",
 "end-before": "END pipeline_test_decoding",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START test_decoding",
 "end-before": "END test_decoding",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 If you used a model that added special characters to represent subtokens
 of a given "word" (like the `"##"` in
 WordPiece) you will need to customize the [decoder]{.title-ref} to treat
 them properly. If we take our previous `bert_tokenizer` for instance the
 default decoing will give:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START bert_test_decoding",
 "end-before": "END bert_test_decoding",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START bert_test_decoding",
 "end-before": "END bert_test_decoding",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START bert_test_decoding",
 "end-before": "END bert_test_decoding",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 But by changing it to a proper decoder, we get:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_pipeline.py",
 "language": "python",
 "start-after": "START bert_proper_decoding",
 "end-before": "END bert_proper_decoding",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START bert_proper_decoding",
 "end-before": "END bert_proper_decoding",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/pipeline.test.ts",
 "language": "js",
 "start-after": "START bert_proper_decoding",
 "end-before": "END bert_proper_decoding",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/quicktour.mdx
+++ b/docs/source-doc-builder/quicktour.mdx
@ -0,0 +1,838 @@
 # Quicktour
 Let's have a quick look at the 🤗 Tokenizers library features. The
 library provides an implementation of today's most used tokenizers that
 is both easy to use and blazing fast.
 ## Build a tokenizer from scratch
 To illustrate how fast the 🤗 Tokenizers library is, let's train a new
 tokenizer on [wikitext-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
 (516M of text) in just a few seconds. First things first, you will need
 to download this dataset and unzip it with:
 ``` bash
 wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
 unzip wikitext-103-raw-v1.zip
 ```
 ### Training the tokenizer
 In this tour, we will build and train a Byte-Pair Encoding (BPE)
 tokenizer. For more information about the different type of tokenizers,
 check out this [guide](https://huggingface.co/transformers/tokenizer_summary.html) in
 the 🤗 Transformers documentation. Here, training the tokenizer means it
 will learn merge rules by:
 -   Start with all the characters present in the training corpus as
    tokens.
 -   Identify the most common pair of tokens and merge it into one token.
 -   Repeat until the vocabulary (e.g., the number of tokens) has reached
    the size we want.
 The main API of the library is the `class` `Tokenizer`, here is how
 we instantiate one with a BPE model:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START init_tokenizer",
 "end-before": "END init_tokenizer",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_init_tokenizer",
 "end-before": "END quicktour_init_tokenizer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START init_tokenizer",
 "end-before": "END init_tokenizer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 To train our tokenizer on the wikitext files, we will need to
 instantiate a [trainer]{.title-ref}, in this case a
 `BpeTrainer`
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START init_trainer",
 "end-before": "END init_trainer",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_init_trainer",
 "end-before": "END quicktour_init_trainer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START init_trainer",
 "end-before": "END init_trainer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 We can set the training arguments like `vocab_size` or `min_frequency` (here
 left at their default values of 30,000 and 0) but the most important
 part is to give the `special_tokens` we
 plan to use later on (they are not used at all during training) so that
 they get inserted in the vocabulary.
 <Tip>
 The order in which you write the special tokens list matters: here `"[UNK]"` will get the ID 0,
 `"[CLS]"` will get the ID 1 and so forth.
 </Tip>
 We could train our tokenizer right now, but it wouldn't be optimal.
 Without a pre-tokenizer that will split our inputs into words, we might
 get tokens that overlap several words: for instance we could get an
 `"it is"` token since those two words
 often appear next to each other. Using a pre-tokenizer will ensure no
 token is bigger than a word returned by the pre-tokenizer. Here we want
 to train a subword BPE tokenizer, and we will use the easiest
 pre-tokenizer possible by splitting on whitespace.
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START init_pretok",
 "end-before": "END init_pretok",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_init_pretok",
 "end-before": "END quicktour_init_pretok",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START init_pretok",
 "end-before": "END init_pretok",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 Now, we can just call the `Tokenizer.train` method with any list of files we want to use:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START train",
 "end-before": "END train",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_train",
 "end-before": "END quicktour_train",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START train",
 "end-before": "END train",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 This should only take a few seconds to train our tokenizer on the full
 wikitext dataset! To save the tokenizer in one file that contains all
 its configuration and vocabulary, just use the
 `Tokenizer.save` method:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START save",
 "end-before": "END save",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_save",
 "end-before": "END quicktour_save",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START save",
 "end-before": "END save",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 and you can reload your tokenizer from that file with the
 `Tokenizer.from_file`
 `classmethod`:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START reload_tokenizer",
 "end-before": "END reload_tokenizer",
 "dedent": 12}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_reload_tokenizer",
 "end-before": "END quicktour_reload_tokenizer",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START reload_tokenizer",
 "end-before": "END reload_tokenizer",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 ### Using the tokenizer
 Now that we have trained a tokenizer, we can use it on any text we want
 with the `Tokenizer.encode` method:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START encode",
 "end-before": "END encode",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_encode",
 "end-before": "END quicktour_encode",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START encode",
 "end-before": "END encode",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 This applied the full pipeline of the tokenizer on the text, returning
 an `Encoding` object. To learn more
 about this pipeline, and how to apply (or customize) parts of it, check out `this page <pipeline>`.
 This `Encoding` object then has all the
 attributes you need for your deep learning model (or other). The
 `tokens` attribute contains the
 segmentation of your text in tokens:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START print_tokens",
 "end-before": "END print_tokens",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_print_tokens",
 "end-before": "END quicktour_print_tokens",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START print_tokens",
 "end-before": "END print_tokens",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 Similarly, the `ids` attribute will
 contain the index of each of those tokens in the tokenizer's
 vocabulary:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START print_ids",
 "end-before": "END print_ids",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_print_ids",
 "end-before": "END quicktour_print_ids",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START print_ids",
 "end-before": "END print_ids",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 An important feature of the 🤗 Tokenizers library is that it comes with
 full alignment tracking, meaning you can always get the part of your
 original sentence that corresponds to a given token. Those are stored in
 the `offsets` attribute of our
 `Encoding` object. For instance, let's
 assume we would want to find back what caused the
 `"[UNK]"` token to appear, which is the
 token at index 9 in the list, we can just ask for the offset at the
 index:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START print_offsets",
 "end-before": "END print_offsets",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_print_offsets",
 "end-before": "END quicktour_print_offsets",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START print_offsets",
 "end-before": "END print_offsets",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 and those are the indices that correspond to the emoji in the original
 sentence:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START use_offsets",
 "end-before": "END use_offsets",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_use_offsets",
 "end-before": "END quicktour_use_offsets",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START use_offsets",
 "end-before": "END use_offsets",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 ### Post-processing
 We might want our tokenizer to automatically add special tokens, like
 `"[CLS]"` or `"[SEP]"`. To do this, we use a post-processor.
 `TemplateProcessing` is the most
 commonly used, you just have to specify a template for the processing of
 single sentences and pairs of sentences, along with the special tokens
 and their IDs.
 When we built our tokenizer, we set `"[CLS]"` and `"[SEP]"` in positions 1
 and 2 of our list of special tokens, so this should be their IDs. To
 double-check, we can use the `Tokenizer.token_to_id` method:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START check_sep",
 "end-before": "END check_sep",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_check_sep",
 "end-before": "END quicktour_check_sep",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START check_sep",
 "end-before": "END check_sep",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 Here is how we can set the post-processing to give us the traditional
 BERT inputs:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START init_template_processing",
 "end-before": "END init_template_processing",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_init_template_processing",
 "end-before": "END quicktour_init_template_processing",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START init_template_processing",
 "end-before": "END init_template_processing",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 Let's go over this snippet of code in more details. First we specify
 the template for single sentences: those should have the form
 `"[CLS] $A [SEP]"` where
 `$A` represents our sentence.
 Then, we specify the template for sentence pairs, which should have the
 form `"[CLS] $A [SEP] $B [SEP]"` where
 `$A` represents the first sentence and
 `$B` the second one. The
 `:1` added in the template represent the `type IDs` we want for each part of our input: it defaults
 to 0 for everything (which is why we don't have
 `$A:0`) and here we set it to 1 for the
 tokens of the second sentence and the last `"[SEP]"` token.
 Lastly, we specify the special tokens we used and their IDs in our
 tokenizer's vocabulary.
 To check out this worked properly, let's try to encode the same
 sentence as before:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START print_special_tokens",
 "end-before": "END print_special_tokens",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_print_special_tokens",
 "end-before": "END quicktour_print_special_tokens",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START print_special_tokens",
 "end-before": "END print_special_tokens",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 To check the results on a pair of sentences, we just pass the two
 sentences to `Tokenizer.encode`:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START print_special_tokens_pair",
 "end-before": "END print_special_tokens_pair",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_print_special_tokens_pair",
 "end-before": "END quicktour_print_special_tokens_pair",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START print_special_tokens_pair",
 "end-before": "END print_special_tokens_pair",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 You can then check the type IDs attributed to each token is correct with
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START print_type_ids",
 "end-before": "END print_type_ids",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_print_type_ids",
 "end-before": "END quicktour_print_type_ids",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START print_type_ids",
 "end-before": "END print_type_ids",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 If you save your tokenizer with `Tokenizer.save`, the post-processor will be saved along.
 ### Encoding multiple sentences in a batch
 To get the full speed of the 🤗 Tokenizers library, it's best to
 process your texts by batches by using the
 `Tokenizer.encode_batch` method:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START encode_batch",
 "end-before": "END encode_batch",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_encode_batch",
 "end-before": "END quicktour_encode_batch",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START encode_batch",
 "end-before": "END encode_batch",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 The output is then a list of `Encoding`
 objects like the ones we saw before. You can process together as many
 texts as you like, as long as it fits in memory.
 To process a batch of sentences pairs, pass two lists to the
 `Tokenizer.encode_batch` method: the
 list of sentences A and the list of sentences B:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START encode_batch_pair",
 "end-before": "END encode_batch_pair",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_encode_batch_pair",
 "end-before": "END quicktour_encode_batch_pair",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START encode_batch_pair",
 "end-before": "END encode_batch_pair",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 When encoding multiple sentences, you can automatically pad the outputs
 to the longest sentence present by using
 `Tokenizer.enable_padding`, with the
 `pad_token` and its ID (which we can
 double-check the id for the padding token with
 `Tokenizer.token_to_id` like before):
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START enable_padding",
 "end-before": "END enable_padding",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_enable_padding",
 "end-before": "END quicktour_enable_padding",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START enable_padding",
 "end-before": "END enable_padding",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 We can set the `direction` of the padding
 (defaults to the right) or a given `length` if we want to pad every sample to that specific number (here
 we leave it unset to pad to the size of the longest text).
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START print_batch_tokens",
 "end-before": "END print_batch_tokens",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_print_batch_tokens",
 "end-before": "END quicktour_print_batch_tokens",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START print_batch_tokens",
 "end-before": "END print_batch_tokens",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 In this case, the `attention mask` generated by the
 tokenizer takes the padding into account:
 <tokenizerslangcontent>
 <python>
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_quicktour.py",
 "language": "python",
 "start-after": "START print_attention_mask",
 "end-before": "END print_attention_mask",
 "dedent": 8}
 </literalinclude>
 </python>
 <rust>
 <literalinclude>
 {"path": "../../tokenizers/tests/documentation.rs",
 "language": "rust",
 "start-after": "START quicktour_print_attention_mask",
 "end-before": "END quicktour_print_attention_mask",
 "dedent": 4}
 </literalinclude>
 </rust>
 <node>
 <literalinclude>
 {"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
 "language": "js",
 "start-after": "START print_attention_mask",
 "end-before": "END print_attention_mask",
 "dedent": 8}
 </literalinclude>
 </node>
 </tokenizerslangcontent>
 ## Pretrained
 <tokenizerslangcontent>
 <python>
 ### Using a pretrained tokenizer
 You can load any tokenizer from the Hugging Face Hub as long as a
 `tokenizer.json` file is available in the repository.
 ```python
 from tokenizers import Tokenizer
 tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
 ```
 ### Importing a pretrained tokenizer from legacy vocabulary files
 You can also import a pretrained tokenizer directly in, as long as you
 have its vocabulary file. For instance, here is how to import the
 classic pretrained BERT tokenizer:
 ```python
 from tokenizers import BertWordPieceTokenizer
 tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
 ```
 as long as you have downloaded the file `bert-base-uncased-vocab.txt` with
 ```bash
 wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
 ```
 </python>
 </tokenizerslangcontent>
--- a/docs/source-doc-builder/training_from_memory.mdx
+++ b/docs/source-doc-builder/training_from_memory.mdx
@ -0,0 +1,116 @@
 # Training from memory
 In the [Quicktour](quicktour.html), we saw how to build and train a
 tokenizer using text files, but we can actually use any Python Iterator.
 In this section we'll see a few different ways of training our
 tokenizer.
 For all the examples listed below, we'll use the same [`~tokenizers.Tokenizer`] and
 [`~tokenizers.trainers.Trainer`], built as
 following:
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
 "language": "python",
 "start-after": "START init_tokenizer_trainer",
 "end-before": "END init_tokenizer_trainer",
 "dedent": 8}
 </literalinclude>
 This tokenizer is based on the [`~tokenizers.models.Unigram`] model. It
 takes care of normalizing the input using the NFKC Unicode normalization
 method, and uses a [`~tokenizers.pre_tokenizers.ByteLevel`] pre-tokenizer with the corresponding decoder.
 For more information on the components used here, you can check
 [here](components.html)
 ## The most basic way
 As you probably guessed already, the easiest way to train our tokenizer
 is by using a `List`{.interpreted-text role="obj"}:
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
 "language": "python",
 "start-after": "START train_basic",
 "end-before": "END train_basic",
 "dedent": 8}
 </literalinclude>
 Easy, right? You can use anything working as an iterator here, be it a
 `List`{.interpreted-text role="obj"}, `Tuple`{.interpreted-text
 role="obj"}, or a `np.Array`{.interpreted-text role="obj"}. Anything
 works as long as it provides strings.
 ## Using the 🤗 Datasets library
 An awesome way to access one of the many datasets that exist out there
 is by using the 🤗 Datasets library. For more information about it, you
 should check [the official documentation
 here](https://huggingface.co/docs/datasets/).
 Let's start by loading our dataset:
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
 "language": "python",
 "start-after": "START load_dataset",
 "end-before": "END load_dataset",
 "dedent": 8}
 </literalinclude>
 The next step is to build an iterator over this dataset. The easiest way
 to do this is probably by using a generator:
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
 "language": "python",
 "start-after": "START def_batch_iterator",
 "end-before": "END def_batch_iterator",
 "dedent": 8}
 </literalinclude>
 As you can see here, for improved efficiency we can actually provide a
 batch of examples used to train, instead of iterating over them one by
 one. By doing so, we can expect performances very similar to those we
 got while training directly from files.
 With our iterator ready, we just need to launch the training. In order
 to improve the look of our progress bars, we can specify the total
 length of the dataset:
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
 "language": "python",
 "start-after": "START train_datasets",
 "end-before": "END train_datasets",
 "dedent": 8}
 </literalinclude>
 And that's it!
 ## Using gzip files
 Since gzip files in Python can be used as iterators, it is extremely
 simple to train on such files:
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
 "language": "python",
 "start-after": "START single_gzip",
 "end-before": "END single_gzip",
 "dedent": 8}
 </literalinclude>
 Now if we wanted to train from multiple gzip files, it wouldn't be much
 harder:
 <literalinclude>
 {"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
 "language": "python",
 "start-after": "START multi_gzip",
 "end-before": "END multi_gzip",
 "dedent": 8}
 </literalinclude>
 And voilà!