Fix indexing bug in add_tokens()

When adding tokens to Tokenizer id of each new token is currently calculated by `self.model.get_vocab_size() - 1 + self.added_tokens.len()` which seems like a bug. For example I have a vocabulary of 1000 tokens with ids from 0 to 999. Now I want to add a single new token so it has id 1000. What I get, however, is 1000 - 1 + 0 = 999. Seems like you don't need this `-1` here.
This commit is contained in:
Denis Zolotukhin
2020-01-21 12:58:22 +03:00
committed by GitHub
parent da7e629e4a
commit 048ab46089

View File

@ -587,7 +587,7 @@ impl Tokenizer {
continue;
}
let new_id = (self.model.get_vocab_size() - 1 + self.added_tokens.len()) as u32;
let new_id = (self.model.get_vocab_size() + self.added_tokens.len()) as u32;
let id = self
.added_tokens
.entry(token.clone())