diff --git a/bindings/python/CHANGELOG.md b/bindings/python/CHANGELOG.md index 4010fd9f..9dbe5d9c 100644 --- a/bindings/python/CHANGELOG.md +++ b/bindings/python/CHANGELOG.md @@ -3,6 +3,12 @@ ## Changes: - Keep only one progress bar while reading files during training. This is better for use-cases with a high number of files as it avoids having too many progress bar on screen. +- `add_prefix_space` option of the `ByteLevel` `PreTokenizer` has been moved to a `Normalizer` + +## How to migrate: +- Use the `ByteLevel` `Normalizer` with `add_prefix_space=True` in addition to the `PreTokenizer`. +The `PreTokenizer` does not handle this option anymore. This fixes some issues with the offsets +being wrong if this option was on. # v0.6.0 diff --git a/tokenizers/CHANGELOG.md b/tokenizers/CHANGELOG.md index 8a76b7f6..dc2fcb52 100644 --- a/tokenizers/CHANGELOG.md +++ b/tokenizers/CHANGELOG.md @@ -4,6 +4,11 @@ - Keep only one progress bar while reading files during training. This is better for use-cases with a high number of files as it avoids having too many progress bar on screen. - Improve BPE and WordPiece builders. +- `ByteLevel` is also a `Normalizer` and handles the `add_prefix_space` option at this level now. +This fixes some issues with the offsets being wrong if this option was on. + +## How to migrate: +- Use the `ByteLevel` as a `Normalizer` if `add_prefix_space` is required. # v0.8.0