mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Doc - Update Model part of the Pipeline page
This commit is contained in:
@ -34,6 +34,14 @@
|
|||||||
:class:`~tokenizers.pre_tokenizers.Whitespace`
|
:class:`~tokenizers.pre_tokenizers.Whitespace`
|
||||||
PreTokenizer
|
PreTokenizer
|
||||||
:class:`~tokenizers.pre_tokenizers.PreTokenizer`
|
:class:`~tokenizers.pre_tokenizers.PreTokenizer`
|
||||||
|
models.BPE
|
||||||
|
:class:`~tokenizers.models.BPE`
|
||||||
|
models.Unigram
|
||||||
|
:class:`~tokenizers.models.Unigram`
|
||||||
|
models.WordLevel
|
||||||
|
:class:`~tokenizers.models.WordLevel`
|
||||||
|
models.WordPiece
|
||||||
|
:class:`~tokenizers.models.WordPiece`
|
||||||
|
|
||||||
.. entities:: rust
|
.. entities:: rust
|
||||||
|
|
||||||
@ -71,6 +79,14 @@
|
|||||||
:rust:struct:`~tokenizers::normalizers::whitespace::Whitespace`
|
:rust:struct:`~tokenizers::normalizers::whitespace::Whitespace`
|
||||||
PreTokenizer
|
PreTokenizer
|
||||||
:rust:trait:`~tokenizers::tokenizer::PreTokenizer`
|
:rust:trait:`~tokenizers::tokenizer::PreTokenizer`
|
||||||
|
models.BPE
|
||||||
|
:rust:struct:`~tokenizers::models::bpe::BPE`
|
||||||
|
models.Unigram
|
||||||
|
:rust:struct:`~tokenizers::models::unigram::Unigram`
|
||||||
|
models.WordLevel
|
||||||
|
:rust:struct:`~tokenizers::models::wordlevel::WordLevel`
|
||||||
|
models.WordPiece
|
||||||
|
:rust:struct:`~tokenizers::models::wordpiece::WordPiece`
|
||||||
|
|
||||||
.. entities:: node
|
.. entities:: node
|
||||||
|
|
||||||
@ -108,3 +124,11 @@
|
|||||||
:obj:`Whitespace`
|
:obj:`Whitespace`
|
||||||
PreTokenizer
|
PreTokenizer
|
||||||
:obj:`PreTokenizer`
|
:obj:`PreTokenizer`
|
||||||
|
models.BPE
|
||||||
|
:obj:`BPE`
|
||||||
|
models.Unigram
|
||||||
|
:obj:`Unigram`
|
||||||
|
models.WordLevel
|
||||||
|
:obj:`WordLevel`
|
||||||
|
models.WordPiece
|
||||||
|
:obj:`WordPiece`
|
||||||
|
@ -246,20 +246,20 @@ scratch afterward.
|
|||||||
The Model
|
The Model
|
||||||
----------------------------------------------------------------------------------------------------
|
----------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
Once the input texts are normalized and pre-tokenized, we can apply the model on the pre-tokens.
|
Once the input texts are normalized and pre-tokenized, the :entity:`Tokenizer` applies the model on
|
||||||
This is the part of the pipeline that needs training on your corpus (or that has been trained if you
|
the pre-tokens. This is the part of the pipeline that needs training on your corpus (or that has
|
||||||
are using a pretrained tokenizer).
|
been trained if you are using a pretrained tokenizer).
|
||||||
|
|
||||||
The role of the model is to split your "words" into tokens, using the rules it has learned. It's
|
The role of the model is to split your "words" into tokens, using the rules it has learned. It's
|
||||||
also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model.
|
also responsible for mapping those tokens to their corresponding IDs in the vocabulary of the model.
|
||||||
|
|
||||||
This model is passed along when intializing the :class:`~tokenizers.Tokenizer` so you already know
|
This model is passed along when intializing the :entity:`Tokenizer` so you already know
|
||||||
how to customize this part. Currently, the 🤗 Tokenizers library supports:
|
how to customize this part. Currently, the 🤗 Tokenizers library supports:
|
||||||
|
|
||||||
- :class:`~tokenizers.models.BPE`
|
- :entity:`models.BPE`
|
||||||
- :class:`~tokenizers.models.Unigram`
|
- :entity:`models.Unigram`
|
||||||
- :class:`~tokenizers.models.WordLevel`
|
- :entity:`models.WordLevel`
|
||||||
- :class:`~tokenizers.models.WordPiece`
|
- :entity:`models.WordPiece`
|
||||||
|
|
||||||
For more details about each model and its behavior, you can check `here <components.html#models>`__
|
For more details about each model and its behavior, you can check `here <components.html#models>`__
|
||||||
|
|
||||||
|
Reference in New Issue
Block a user