mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Doc - Update Decoder part of the Pipeline page
This commit is contained in:
@ -18,6 +18,10 @@
|
||||
:meth:`~tokenizers.Tokenizer.encode`
|
||||
Tokenizer.encode_batch
|
||||
:meth:`~tokenizers.Tokenizer.encode_batch`
|
||||
Tokenizer.decode
|
||||
:meth:`~tokenizers.Tokenizer.decode`
|
||||
Tokenizer.decode_batch
|
||||
:meth:`~tokenizers.Tokenizer.decode_batch`
|
||||
Tokenizer.token_to_id
|
||||
:meth:`~tokenizers.Tokenizer.token_to_id`
|
||||
Tokenizer.enable_padding
|
||||
@ -42,6 +46,8 @@
|
||||
:class:`~tokenizers.models.WordLevel`
|
||||
models.WordPiece
|
||||
:class:`~tokenizers.models.WordPiece`
|
||||
Decoder
|
||||
:class:`~tokenizers.decoders.Decoder`
|
||||
|
||||
.. entities:: rust
|
||||
|
||||
@ -63,6 +69,10 @@
|
||||
:rust:meth:`~tokenizers::tokenizer::Tokenizer::encode`
|
||||
Tokenizer.encode_batch
|
||||
:rust:meth:`~tokenizers::tokenizer::Tokenizer::encode_batch`
|
||||
Tokenizer.decode
|
||||
:rust:meth:`~tokenizers::tokenizer::Tokenizer::decode`
|
||||
Tokenizer.decode_batch
|
||||
:rust:meth:`~tokenizers::tokenizer::Tokenizer::decode_batch`
|
||||
Tokenizer.token_to_id
|
||||
:rust:meth:`~tokenizers::tokenizer::Tokenizer::token_to_id`
|
||||
Tokenizer.enable_padding
|
||||
@ -87,6 +97,8 @@
|
||||
:rust:struct:`~tokenizers::models::wordlevel::WordLevel`
|
||||
models.WordPiece
|
||||
:rust:struct:`~tokenizers::models::wordpiece::WordPiece`
|
||||
Decoder
|
||||
:rust:trait:`~tokenizers::tokenizer::Decoder`
|
||||
|
||||
.. entities:: node
|
||||
|
||||
@ -108,6 +120,10 @@
|
||||
:obj:`Tokenizer.encode()`
|
||||
Tokenizer.encode_batch
|
||||
:obj:`Tokenizer.encodeBatch()`
|
||||
Tokenizer.decode
|
||||
:obj:`Tokenizer.decode()`
|
||||
Tokenizer.decode_batch
|
||||
:obj:`Tokenizer.decodeBatch()`
|
||||
Tokenizer.token_to_id
|
||||
:obj:`Tokenizer.tokenToId()`
|
||||
Tokenizer.enable_padding
|
||||
@ -132,3 +148,5 @@
|
||||
:obj:`WordLevel`
|
||||
models.WordPiece
|
||||
:obj:`WordPiece`
|
||||
Decoder
|
||||
:obj:`Decoder`
|
||||
|
@ -447,40 +447,104 @@ We can use this tokenizer and train on it on wikitext like in the :doc:`quicktou
|
||||
Decoding
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
||||
On top of encoding the input texts, a :class:`~tokenizers.Tokenizer` also has an API for decoding,
|
||||
.. entities:: python
|
||||
|
||||
bert_tokenizer
|
||||
:obj:`bert_tokenizer`
|
||||
|
||||
.. entities:: rust
|
||||
|
||||
bert_tokenizer
|
||||
:obj:`bert_tokenizer`
|
||||
|
||||
.. entities:: node
|
||||
|
||||
bert_tokenizer
|
||||
:obj:`bertTokenizer`
|
||||
|
||||
|
||||
On top of encoding the input texts, a :entity:`Tokenizer` also has an API for decoding,
|
||||
that is converting IDs generated by your model back to a text. This is done by the methods
|
||||
:meth:`~tokenizers.Tokenizer.decode` (for one predicted text) and
|
||||
:meth:`~tokenizers.Tokenizer.decode_batch` (for a batch of predictions).
|
||||
:entity:`Tokenizer.decode` (for one predicted text) and :entity:`Tokenizer.decode_batch` (for a
|
||||
batch of predictions).
|
||||
|
||||
The `decoder` will first convert the IDs back to tokens (using the tokenizer's vocabulary) and
|
||||
remove all special tokens, then join those tokens with spaces:
|
||||
|
||||
.. code-block:: python
|
||||
.. only:: python
|
||||
|
||||
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
|
||||
print(output.ids)
|
||||
# [27194, 16, 93, 11, 5068, 5, 7928, 5083, 6190, 0, 35]
|
||||
.. literalinclude:: ../../bindings/python/tests/documentation/test_pipeline.py
|
||||
:language: python
|
||||
:start-after: START test_decoding
|
||||
:end-before: END test_decoding
|
||||
:dedent: 8
|
||||
|
||||
tokenizer.decode([27194, 16, 93, 11, 5068, 5, 7928, 5083, 6190, 0, 35])
|
||||
# "Hello , y ' all ! How are you ?"
|
||||
.. only:: rust
|
||||
|
||||
.. literalinclude:: ../../tokenizers/tests/documentation.rs
|
||||
:language: rust
|
||||
:start-after: START pipeline_test_decoding
|
||||
:end-before: END pipeline_test_decoding
|
||||
:dedent: 4
|
||||
|
||||
.. only:: node
|
||||
|
||||
.. literalinclude:: ../../bindings/node/examples/documentation/pipeline.test.ts
|
||||
:language: javascript
|
||||
:start-after: START test_decoding
|
||||
:end-before: END test_decoding
|
||||
:dedent: 8
|
||||
|
||||
If you used a model that added special characters to represent subtokens of a given "word" (like
|
||||
the :obj:`"##"` in WordPiece) you will need to customize the `decoder` to treat them properly. If we
|
||||
take our previous :obj:`bert_tokenizer` for instance the default decoing will give:
|
||||
take our previous :entity:`bert_tokenizer` for instance the default decoing will give:
|
||||
|
||||
.. code-block:: python
|
||||
.. only:: python
|
||||
|
||||
output = bert_tokenizer.encode("Welcome to the 🤗 Tokenizers library.")
|
||||
print(output.tokens)
|
||||
# ["[CLS]", "welcome", "to", "the", "[UNK]", "tok", "##eni", "##zer", "##s", "library", ".", "[SEP]"]
|
||||
.. literalinclude:: ../../bindings/python/tests/documentation/test_pipeline.py
|
||||
:language: python
|
||||
:start-after: START bert_test_decoding
|
||||
:end-before: END bert_test_decoding
|
||||
:dedent: 8
|
||||
|
||||
bert_tokenizer.decoder(output.ids)
|
||||
# "welcome to the tok ##eni ##zer ##s library ."
|
||||
.. only:: rust
|
||||
|
||||
.. literalinclude:: ../../tokenizers/tests/documentation.rs
|
||||
:language: rust
|
||||
:start-after: START bert_test_decoding
|
||||
:end-before: END bert_test_decoding
|
||||
:dedent: 4
|
||||
|
||||
.. only:: node
|
||||
|
||||
.. literalinclude:: ../../bindings/node/examples/documentation/pipeline.test.ts
|
||||
:language: javascript
|
||||
:start-after: START bert_test_decoding
|
||||
:end-before: END bert_test_decoding
|
||||
:dedent: 8
|
||||
|
||||
But by changing it to a proper decoder, we get:
|
||||
|
||||
.. code-block:: python
|
||||
.. only:: python
|
||||
|
||||
bert_tokenizer.decoder = tokenizers.decoders.WordPiece()
|
||||
bert_tokenizer.decode(output.ids)
|
||||
# "welcome to the tokenizers library."
|
||||
.. literalinclude:: ../../bindings/python/tests/documentation/test_pipeline.py
|
||||
:language: python
|
||||
:start-after: START bert_proper_decoding
|
||||
:end-before: END bert_proper_decoding
|
||||
:dedent: 8
|
||||
|
||||
.. only:: rust
|
||||
|
||||
.. literalinclude:: ../../tokenizers/tests/documentation.rs
|
||||
:language: rust
|
||||
:start-after: START bert_proper_decoding
|
||||
:end-before: END bert_proper_decoding
|
||||
:dedent: 4
|
||||
|
||||
.. only:: node
|
||||
|
||||
.. literalinclude:: ../../bindings/node/examples/documentation/pipeline.test.ts
|
||||
:language: javascript
|
||||
:start-after: START bert_proper_decoding
|
||||
:end-before: END bert_proper_decoding
|
||||
:dedent: 8
|
||||
|
Reference in New Issue
Block a user