From 00132ba8366d1b0f3a3ca6dcf9b3c42b1638b9fd Mon Sep 17 00:00:00 2001 From: Mishig Davaadorj Date: Mon, 25 Apr 2022 21:03:31 +0200 Subject: [PATCH] Update pipeline.mdx Fix conversion errors --- docs/source-doc-builder/pipeline.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source-doc-builder/pipeline.mdx b/docs/source-doc-builder/pipeline.mdx index 9a9e7079..d40029a0 100644 --- a/docs/source-doc-builder/pipeline.mdx +++ b/docs/source-doc-builder/pipeline.mdx @@ -520,7 +520,7 @@ On top of encoding the input texts, a `Tokenizer` also has an API for decoding, generated by your model back to a text. This is done by the methods `Tokenizer.decode` (for one predicted text) and `Tokenizer.decode_batch` (for a batch of predictions). -The [decoder]{.title-ref} will first convert the IDs back to tokens +The `decoder` will first convert the IDs back to tokens (using the tokenizer's vocabulary) and remove all special tokens, then join those tokens with spaces: @@ -556,7 +556,7 @@ join those tokens with spaces: If you used a model that added special characters to represent subtokens of a given "word" (like the `"##"` in -WordPiece) you will need to customize the [decoder]{.title-ref} to treat +WordPiece) you will need to customize the `decoder` to treat them properly. If we take our previous `bert_tokenizer` for instance the default decoing will give: