Update CHANGELOGs

2025-12-07 05:08:24 +00:00 · 2020-03-05 17:32:43 -05:00
parent d778ed5e0a
commit 86d2e90ad2
2 changed files with 5 additions and 0 deletions
--- a/bindings/python/CHANGELOG.md
+++ b/bindings/python/CHANGELOG.md
@@ -4,11 +4,14 @@
 - Keep only one progress bar while reading files during training. This is better for use-cases with
 a high number of files as it avoids having too many progress bar on screen.
 - `add_prefix_space` option of the `ByteLevel` `PreTokenizer` has been moved to a `Normalizer`
+- Added the `ByteLevel` `PostProcessor` to take care of fixing the offsets when a unicode character
+gets split up as multiple byte-level characters.

 ## How to migrate:
 - Use the `ByteLevel` `Normalizer` with `add_prefix_space=True` in addition to the `PreTokenizer`.
 The `PreTokenizer` does not handle this option anymore. This fixes some issues with the offsets
 being wrong if this option was on.
+- Add the `ByteLevel` `PostProcessor` to your byte-level BPE tokenizers.

 # v0.6.0

--- a/tokenizers/CHANGELOG.md
+++ b/tokenizers/CHANGELOG.md
@@ -6,6 +6,8 @@ a high number of files as it avoids having too many progress bar on screen.
 - Improve BPE and WordPiece builders.
 - `ByteLevel` is also a `Normalizer` and handles the `add_prefix_space` option at this level now.
 This fixes some issues with the offsets being wrong if this option was on.
+- `ByteLevel` is also a `PostProcessor` now and handles fixing the offsets when a unicode
+character get split up in a byte-level character.

 ## How to migrate:
 - Use the `ByteLevel` as a `Normalizer` if `add_prefix_space` is required.