Update CHANGELOGs

2025-12-07 05:08:24 +00:00 · 2020-03-18 16:42:27 -04:00
parent d03a2ebc8d
commit ae0d330907
2 changed files with 23 additions and 6 deletions
--- a/bindings/python/CHANGELOG.md
+++ b/bindings/python/CHANGELOG.md
@@ -6,18 +6,28 @@ a high number of files as it avoids having too many progress bars on screen.
 - `ByteLevel` is also a `PostProcessor` now and handles trimming the offsets if activated. This
 avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
 whitespaces are part of the actual token.
-It has been added to `ByteLevelBPETokenizer` and but it is off by default (`trim_offsets=False`).
- `encode` and `encode_batch` no take a new optional argument, specifying whether we should add the
-special tokens. This stays activated by default.
+It has been added to `ByteLevelBPETokenizer` but it is off by default (`trim_offsets=False`).
+([#188](https://github.com/huggingface/tokenizers/pull/188))
+- `encode` and `encode_batch` now take a new optional argument, specifying whether we should add the
+special tokens. This is activated by default. ([#193](https://github.com/huggingface/tokenizers/pull/193))
+- `original_str` and `normalized_str` have been removed from the `Encoding` returned by `encode` and
+`encode_batch`. This brings a reduction of 70% the memory footprint.
+([#197](https://github.com/huggingface/tokenizers/pull/197))

 ## Fixes:
- Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
+- Fix some issues with the offsets being wrong with the `ByteLevel` BPE ([#193](https://github.com/huggingface/tokenizers/pull/193)):
 	- when `add_prefix_space=True`
 	- when a Unicode character gets split-up in multiple byte-level characters ([#156](https://github.com/huggingface/tokenizers/issues/156))

 ## How to migrate:
 - Add the `ByteLevel` `PostProcessor` to your byte-level BPE tokenizers if relevant. If you are
 using `ByteLevelBPETokenizer`, this option is disabled by default (`trim_offsets=False`).
+- Access to the `original_str` on the `Encoding` has been removed. The original string is the input
+of `encode` so it didn't make sense to keep it here.
+- No need to call `original_str.offsets(offsets[N])` to convert offsets to the original string. They
+are now relative to the original string by default.
+- Access to the `normalized_str` on the `Encoding` has been removed. Can be retrieved by calling
+`normalize(sequence)` on the `Tokenizer`

 # v0.6.0