Update CHANGELOGs

2025-12-13 13:58:39 +00:00 · 2020-04-16 10:29:36 -04:00
parent 95d4ee18f7
commit c96c4d95bd
2 changed files with 30 additions and 0 deletions
--- a/bindings/python/CHANGELOG.md
+++ b/bindings/python/CHANGELOG.md
@@ -29,6 +29,7 @@ constructors.
 This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
 whitespaces are part of the actual token.
 It has been added to `ByteLevelBPETokenizer` but it is off by default (`trim_offsets=False`).
 - [#236]: `RobertaProcessing` also handles trimming the offsets.
 - More alignment mappings on the `Encoding`.
 - `post_process` can be called on the `Tokenizer`
 - [#208]: Ability to retrieve the vocabulary from the `Tokenizer` with
@@ -157,7 +158,9 @@ delimiter (Works like `.split(delimiter)`)
 - Fix a bug with the IDs associated with added tokens.
 - Fix a bug that was causing crashes in Python 3.5
 [#236]: https://github.com/huggingface/tokenizers/pull/236
 [#208]: https://github.com/huggingface/tokenizers/pull/208
 [#205]: https://github.com/huggingface/tokenizers/issues/205
 [#197]: https://github.com/huggingface/tokenizers/pull/197
 [#193]: https://github.com/huggingface/tokenizers/pull/193
 [#190]: https://github.com/huggingface/tokenizers/pull/190
--- a/tokenizers/CHANGELOG.md
+++ b/tokenizers/CHANGELOG.md
@@ -4,6 +4,31 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 ## [Unreleased]
 ### Fixed
 - [#236]: Fix a bug with offsets being shifted when there are sub-sequences (Usually with
 special tokens and/or added tokens in the sequence).
 ### Changed
 - [#236]: `AddedToken` with special options like `rstrip` will keep the matched whitespaces
 in the textual representation of the token, exposed in `tokens` on the `Encoding`. The ID stays
 the same as usual. This fixes the offsets for said tokens.
 - [#236]: Offsets are now converted back to the original referential before we merge the
 sub-sequences together and then do the post-processing. This also fixes some offsets bugs.
 - [#236]: ByteLevel PostProcessor now uses the `add_prefix_space` attribute to determine how to
 trim offsets.
 ### Added
 - [#236]: RobertaProcessing is now also taking care of trimming offsets, and works just as ByteLevel
 on this front.
 ### How to migrate
 - Specify the `add_prefix_space` and `trim_offsets` options on `RobertaProcessing` if you don't
 want the offsets trimmed out.
 - Any custom `PostProcessor` now handles offsets relative to the original string (as opposed to the
 normalized one).
 ## [0.10.1]
 ### Fixed
@@ -71,6 +96,8 @@ split up in multiple bytes
 - [#174]: The `LongestFirst` truncation strategy had a bug
 [b770f36]: https://github.com/huggingface/tokenizers/commit/b770f364280af33efeffea8f0003102cda8cf1b7
 [#236]: https://github.com/huggingface/tokenizers/pull/236
 [#226]: https://github.com/huggingface/tokenizers/pull/226
 [#222]: https://github.com/huggingface/tokenizers/pull/222
 [#208]: https://github.com/huggingface/tokenizers/pull/208
 [#205]: https://github.com/huggingface/tokenizers/issues/205