mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Fix indexing bug in add_tokens()
When adding tokens to Tokenizer id of each new token is currently calculated by `self.model.get_vocab_size() - 1 + self.added_tokens.len()` which seems like a bug. For example I have a vocabulary of 1000 tokens with ids from 0 to 999. Now I want to add a single new token so it has id 1000. What I get, however, is 1000 - 1 + 0 = 999. Seems like you don't need this `-1` here.
This commit is contained in:
@ -587,7 +587,7 @@ impl Tokenizer {
|
||||
continue;
|
||||
}
|
||||
|
||||
let new_id = (self.model.get_vocab_size() - 1 + self.added_tokens.len()) as u32;
|
||||
let new_id = (self.model.get_vocab_size() + self.added_tokens.len()) as u32;
|
||||
let id = self
|
||||
.added_tokens
|
||||
.entry(token.clone())
|
||||
|
Reference in New Issue
Block a user