mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Update pipeline.mdx
Fix conversion errors
This commit is contained in:
@ -520,7 +520,7 @@ On top of encoding the input texts, a `Tokenizer` also has an API for decoding,
|
|||||||
generated by your model back to a text. This is done by the methods
|
generated by your model back to a text. This is done by the methods
|
||||||
`Tokenizer.decode` (for one predicted text) and `Tokenizer.decode_batch` (for a batch of predictions).
|
`Tokenizer.decode` (for one predicted text) and `Tokenizer.decode_batch` (for a batch of predictions).
|
||||||
|
|
||||||
The [decoder]{.title-ref} will first convert the IDs back to tokens
|
The `decoder` will first convert the IDs back to tokens
|
||||||
(using the tokenizer's vocabulary) and remove all special tokens, then
|
(using the tokenizer's vocabulary) and remove all special tokens, then
|
||||||
join those tokens with spaces:
|
join those tokens with spaces:
|
||||||
|
|
||||||
@ -556,7 +556,7 @@ join those tokens with spaces:
|
|||||||
|
|
||||||
If you used a model that added special characters to represent subtokens
|
If you used a model that added special characters to represent subtokens
|
||||||
of a given "word" (like the `"##"` in
|
of a given "word" (like the `"##"` in
|
||||||
WordPiece) you will need to customize the [decoder]{.title-ref} to treat
|
WordPiece) you will need to customize the `decoder` to treat
|
||||||
them properly. If we take our previous `bert_tokenizer` for instance the
|
them properly. If we take our previous `bert_tokenizer` for instance the
|
||||||
default decoing will give:
|
default decoing will give:
|
||||||
|
|
||||||
|
Reference in New Issue
Block a user