Doc - Replace some entities in the quicktour

This commit is contained in:
Anthony MOI
2020-10-16 16:01:38 -04:00
committed by Anthony MOI
parent f0b6a2127c
commit 41bf688a49
2 changed files with 74 additions and 10 deletions

View File

@ -37,3 +37,34 @@ Main features:
:caption: API Reference :caption: API Reference
api/reference api/reference
.. entities:: python
:global:
class
class
classmethod
class method
Tokenizer
:class:`~tokenizers.Tokenizer`
Tokenizer.train
:meth:`~tokenizers.Tokenizer.train`
Tokenizer.save
:meth:`~tokenizers.Tokenizer.save`
Tokenizer.from_file
:meth:`~tokenizers.Tokenizer.from_file`
.. entities:: rust
:global:
class
struct
classmethod
static method
Tokenizer
`Tokenizer <https://docs.rs/tokenizers/latest/tokenizers/tokenizer/struct.Tokenizer.html>`__
Tokenizer.train
`train <https://docs.rs/tokenizers/0.10.1/tokenizers/tokenizer/struct.Tokenizer.html#method.train>`__

View File

@ -24,6 +24,39 @@ with:
Training the tokenizer Training the tokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. entities:: python
BpeTrainer
:class:`~tokenizers.trainers.BpeTrainer`
vocab_size
:obj:`vocab_size`
min_frequency
:obj:`min_frequency`
special_tokens
:obj:`special_tokens`
.. entities:: rust
BpeTrainer
`BpeTrainer <https://docs.rs/tokenizers/latest/tokenizers/models/bpe/struct.BpeTrainer.html>`__
vocab_size
:obj:`vocab_size`
min_frequency
:obj:`min_frequency`
special_tokens
:obj:`special_tokens`
.. entities:: node
BpeTrainer
BpeTrainer
vocab_size
:obj:`vocabSize`
min_frequency
:obj:`minFrequency`
special_tokens
:obj:`specialTokens`
In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenizer. For more information In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenizer. For more information
about the different type of tokenizers, check out this `guide about the different type of tokenizers, check out this `guide
<https://huggingface.co/transformers/tokenizer_summary.html>`__ in the 🤗 Transformers <https://huggingface.co/transformers/tokenizer_summary.html>`__ in the 🤗 Transformers
@ -33,7 +66,7 @@ documentation. Here, training the tokenizer means it will learn merge rules by:
- Identify the most common pair of tokens and merge it into one token. - Identify the most common pair of tokens and merge it into one token.
- Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want. - Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want.
The main API of the library is the class :class:`~tokenizers.Tokenizer`, here is how we instantiate The main API of the library is the :entity:`class` :entity:`Tokenizer`, here is how we instantiate
one with a BPE model: one with a BPE model:
.. only:: python .. only:: python
@ -45,7 +78,7 @@ one with a BPE model:
:dedent: 8 :dedent: 8
To train our tokenizer on the wikitext files, we will need to instantiate a `trainer`, in this case To train our tokenizer on the wikitext files, we will need to instantiate a `trainer`, in this case
a :class:`~tokenizers.BpeTrainer`: a :entity:`BpeTrainer`
.. only:: python .. only:: python
@ -55,10 +88,10 @@ a :class:`~tokenizers.BpeTrainer`:
:end-before: END init_trainer :end-before: END init_trainer
:dedent: 8 :dedent: 8
We can set the training arguments like :obj:`vocab_size` or :obj:`min_frequency` (here left at their We can set the training arguments like :entity:`vocab_size` or :entity:`min_frequency` (here left at
default values of 30,000 and 0) but the most important part is to give the :obj:`special_tokens` we their default values of 30,000 and 0) but the most important part is to give the
plan to use later on (they are not used at all during training) so that they get inserted in the :entity:`special_tokens` we plan to use later on (they are not used at all during training) so that
vocabulary. they get inserted in the vocabulary.
.. note:: .. note::
@ -80,7 +113,7 @@ on whitespace.
:end-before: END init_pretok :end-before: END init_pretok
:dedent: 8 :dedent: 8
Now, we can just call the :meth:`~tokenizers.Tokenizer.train` method with any list of files we want Now, we can just call the :entity:`Tokenizer.train` method with any list of files we want
to use: to use:
.. only:: python .. only:: python
@ -105,7 +138,7 @@ first instantiating the model.
:dedent: 8 :dedent: 8
To save the tokenizer in one file that contains all its configuration and vocabulary, just use the To save the tokenizer in one file that contains all its configuration and vocabulary, just use the
:meth:`~tokenizers.Tokenizer.save` method: :entity:`Tokenizer.save` method:
.. only:: python .. only:: python
@ -115,8 +148,8 @@ To save the tokenizer in one file that contains all its configuration and vocabu
:end-before: END save :end-before: END save
:dedent: 8 :dedent: 8
and you can reload your tokenizer from that file with the :meth:`~tokenizers.Tokenizer.from_file` and you can reload your tokenizer from that file with the :entity:`Tokenizer.from_file`
class method: :entity:`classmethod`:
.. only:: python .. only:: python