Rust - Fix offsets when there are added tokens

2025-12-06 20:58:22 +00:00 · 2020-03-19 12:53:03 -04:00
parent 2aeae555e2
commit d953d58cee
6 changed files with 87 additions and 11 deletions
--- a/bindings/python/CHANGELOG.md
+++ b/bindings/python/CHANGELOG.md
@@ -18,6 +18,7 @@ special tokens. This is activated by default. ([#193](https://github.com/hugging
 - Fix some issues with the offsets being wrong with the `ByteLevel` BPE ([#193](https://github.com/huggingface/tokenizers/pull/193)):
 	- when `add_prefix_space=True`
 	- when a Unicode character gets split-up in multiple byte-level characters ([#156](https://github.com/huggingface/tokenizers/issues/156))
+- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.

 ## How to migrate:
 - Add the `ByteLevel` `PostProcessor` to your byte-level BPE tokenizers if relevant. If you are