Fix a bug when adding special tokens

If we add special tokens that are part of the vocabulary of the model, the tokens aren't added to the tokenizer, which then built an empty regex. This completely break the tokenization
2025-08-22 16:25:30 +00:00 · 2019-12-26 14:32:50 -05:00
parent d93d4fc3cd
commit d1e59e09bf
1 changed files with 6 additions and 1 deletions
--- a/tokenizers/src/tokenizer/mod.rs
+++ b/tokenizers/src/tokenizer/mod.rs
@ -577,7 +577,12 @@ impl Tokenizer {
            })
            .collect::<Vec<_>>();

-        self.split_re = Some(regex::Regex::new(&format!(r"({})", added_tokens.join("|"))).unwrap());
+        if added_tokens.is_empty() {
+            self.split_re = None;
+        } else {
+            self.split_re =
+                Some(regex::Regex::new(&format!(r"({})", added_tokens.join("|"))).unwrap());
+        }

        // Return the number of added tokens
        tokens.len() - ignored