Update CHANGELOGs

This commit is contained in:
Anthony MOI
2020-04-16 10:29:36 -04:00
parent 95d4ee18f7
commit c96c4d95bd
2 changed files with 30 additions and 0 deletions

View File

@@ -29,6 +29,7 @@ constructors.
This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
whitespaces are part of the actual token. whitespaces are part of the actual token.
It has been added to `ByteLevelBPETokenizer` but it is off by default (`trim_offsets=False`). It has been added to `ByteLevelBPETokenizer` but it is off by default (`trim_offsets=False`).
- [#236]: `RobertaProcessing` also handles trimming the offsets.
- More alignment mappings on the `Encoding`. - More alignment mappings on the `Encoding`.
- `post_process` can be called on the `Tokenizer` - `post_process` can be called on the `Tokenizer`
- [#208]: Ability to retrieve the vocabulary from the `Tokenizer` with - [#208]: Ability to retrieve the vocabulary from the `Tokenizer` with
@@ -157,7 +158,9 @@ delimiter (Works like `.split(delimiter)`)
- Fix a bug with the IDs associated with added tokens. - Fix a bug with the IDs associated with added tokens.
- Fix a bug that was causing crashes in Python 3.5 - Fix a bug that was causing crashes in Python 3.5
[#236]: https://github.com/huggingface/tokenizers/pull/236
[#208]: https://github.com/huggingface/tokenizers/pull/208 [#208]: https://github.com/huggingface/tokenizers/pull/208
[#205]: https://github.com/huggingface/tokenizers/issues/205
[#197]: https://github.com/huggingface/tokenizers/pull/197 [#197]: https://github.com/huggingface/tokenizers/pull/197
[#193]: https://github.com/huggingface/tokenizers/pull/193 [#193]: https://github.com/huggingface/tokenizers/pull/193
[#190]: https://github.com/huggingface/tokenizers/pull/190 [#190]: https://github.com/huggingface/tokenizers/pull/190

View File

@@ -4,6 +4,31 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
### Fixed
- [#236]: Fix a bug with offsets being shifted when there are sub-sequences (Usually with
special tokens and/or added tokens in the sequence).
### Changed
- [#236]: `AddedToken` with special options like `rstrip` will keep the matched whitespaces
in the textual representation of the token, exposed in `tokens` on the `Encoding`. The ID stays
the same as usual. This fixes the offsets for said tokens.
- [#236]: Offsets are now converted back to the original referential before we merge the
sub-sequences together and then do the post-processing. This also fixes some offsets bugs.
- [#236]: ByteLevel PostProcessor now uses the `add_prefix_space` attribute to determine how to
trim offsets.
### Added
- [#236]: RobertaProcessing is now also taking care of trimming offsets, and works just as ByteLevel
on this front.
### How to migrate
- Specify the `add_prefix_space` and `trim_offsets` options on `RobertaProcessing` if you don't
want the offsets trimmed out.
- Any custom `PostProcessor` now handles offsets relative to the original string (as opposed to the
normalized one).
## [0.10.1] ## [0.10.1]
### Fixed ### Fixed
@@ -71,6 +96,8 @@ split up in multiple bytes
- [#174]: The `LongestFirst` truncation strategy had a bug - [#174]: The `LongestFirst` truncation strategy had a bug
[b770f36]: https://github.com/huggingface/tokenizers/commit/b770f364280af33efeffea8f0003102cda8cf1b7 [b770f36]: https://github.com/huggingface/tokenizers/commit/b770f364280af33efeffea8f0003102cda8cf1b7
[#236]: https://github.com/huggingface/tokenizers/pull/236
[#226]: https://github.com/huggingface/tokenizers/pull/226
[#222]: https://github.com/huggingface/tokenizers/pull/222 [#222]: https://github.com/huggingface/tokenizers/pull/222
[#208]: https://github.com/huggingface/tokenizers/pull/208 [#208]: https://github.com/huggingface/tokenizers/pull/208
[#205]: https://github.com/huggingface/tokenizers/issues/205 [#205]: https://github.com/huggingface/tokenizers/issues/205