diff --git a/bindings/python/CHANGELOG.md b/bindings/python/CHANGELOG.md index 9dbe5d9c..246a317f 100644 --- a/bindings/python/CHANGELOG.md +++ b/bindings/python/CHANGELOG.md @@ -4,11 +4,14 @@ - Keep only one progress bar while reading files during training. This is better for use-cases with a high number of files as it avoids having too many progress bar on screen. - `add_prefix_space` option of the `ByteLevel` `PreTokenizer` has been moved to a `Normalizer` +- Added the `ByteLevel` `PostProcessor` to take care of fixing the offsets when a unicode character +gets split up as multiple byte-level characters. ## How to migrate: - Use the `ByteLevel` `Normalizer` with `add_prefix_space=True` in addition to the `PreTokenizer`. The `PreTokenizer` does not handle this option anymore. This fixes some issues with the offsets being wrong if this option was on. +- Add the `ByteLevel` `PostProcessor` to your byte-level BPE tokenizers. # v0.6.0 diff --git a/tokenizers/CHANGELOG.md b/tokenizers/CHANGELOG.md index dc2fcb52..469ccaad 100644 --- a/tokenizers/CHANGELOG.md +++ b/tokenizers/CHANGELOG.md @@ -6,6 +6,8 @@ a high number of files as it avoids having too many progress bar on screen. - Improve BPE and WordPiece builders. - `ByteLevel` is also a `Normalizer` and handles the `add_prefix_space` option at this level now. This fixes some issues with the offsets being wrong if this option was on. +- `ByteLevel` is also a `PostProcessor` now and handles fixing the offsets when a unicode +character get split up in a byte-level character. ## How to migrate: - Use the `ByteLevel` as a `Normalizer` if `add_prefix_space` is required.