mirror of
https://github.com/mii443/tokenizers.git
synced 2025-12-09 22:28:29 +00:00
Update CHANGELOGs before releases
This commit is contained in:
@@ -4,22 +4,7 @@ All notable changes to this project will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
## [Unrelease]
|
## [0.7.0-rc4]
|
||||||
### [Changed]
|
|
||||||
- [#136] Updated Pyo3 version
|
|
||||||
|
|
||||||
### [Added]
|
|
||||||
- [#136] Models can now be instantiated through object constructors.
|
|
||||||
|
|
||||||
### [Removed]
|
|
||||||
- [#136] Static methods `Model.from_files` and `Model.empty` are removed in favor of using
|
|
||||||
constructors.
|
|
||||||
|
|
||||||
### [How to migrate]
|
|
||||||
- Change `Model.from_files` and `Model.empty` to use constructor. The model constructor should take
|
|
||||||
the same arguments as the old methods.
|
|
||||||
|
|
||||||
## [0.7.0-rc3]
|
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
- Only one progress bar while reading files during training. This is better for use-cases with
|
- Only one progress bar while reading files during training. This is better for use-cases with
|
||||||
@@ -35,6 +20,9 @@ normalized one anymore.
|
|||||||
- The added token given to `add_special_tokens` or `add_tokens` on a `Tokenizer`, or while using
|
- The added token given to `add_special_tokens` or `add_tokens` on a `Tokenizer`, or while using
|
||||||
`train(special_tokens=...)` can now be instances of `AddedToken` to provide more control over these
|
`train(special_tokens=...)` can now be instances of `AddedToken` to provide more control over these
|
||||||
tokens.
|
tokens.
|
||||||
|
- [#136] Updated Pyo3 version
|
||||||
|
- [#136] Static methods `Model.from_files` and `Model.empty` are removed in favor of using
|
||||||
|
constructors.
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
- [#188]: `ByteLevel` is also a `PostProcessor` now and handles trimming the offsets if activated.
|
- [#188]: `ByteLevel` is also a `PostProcessor` now and handles trimming the offsets if activated.
|
||||||
@@ -45,6 +33,7 @@ It has been added to `ByteLevelBPETokenizer` but it is off by default (`trim_off
|
|||||||
- `post_process` can be called on the `Tokenizer`
|
- `post_process` can be called on the `Tokenizer`
|
||||||
- [#208]: Ability to retrieve the vocabulary from the `Tokenizer` with
|
- [#208]: Ability to retrieve the vocabulary from the `Tokenizer` with
|
||||||
`get_vocab(with_added_tokens: bool)`
|
`get_vocab(with_added_tokens: bool)`
|
||||||
|
- [#136] Models can now be instantiated through object constructors.
|
||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
- [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
|
- [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
|
||||||
@@ -66,6 +55,8 @@ of `encode` so it didn't make sense to keep it here.
|
|||||||
are now relative to the original string by default.
|
are now relative to the original string by default.
|
||||||
- Access to the `normalized_str` on the `Encoding` has been removed. Can be retrieved by calling
|
- Access to the `normalized_str` on the `Encoding` has been removed. Can be retrieved by calling
|
||||||
`normalize(sequence)` on the `Tokenizer`
|
`normalize(sequence)` on the `Tokenizer`
|
||||||
|
- Change `Model.from_files` and `Model.empty` to use constructor. The model constructor should take
|
||||||
|
the same arguments as the old methods. (ie `BPE(vocab, merges)` or `BPE()`)
|
||||||
|
|
||||||
## [0.6.0]
|
## [0.6.0]
|
||||||
|
|
||||||
|
|||||||
@@ -4,6 +4,18 @@ All notable changes to this project will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
|
## [0.10.0]
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- [#222]: All Tokenizer's subparts must now be `Send + Sync`
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- [#208]: Ability to retrieve the vocabulary from the `Tokenizer` & `Model`
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- [#205]: Trim the decoded string in `BPEDecoder`
|
||||||
|
- [b770f36]: Fix a bug with added tokens generated IDs
|
||||||
|
|
||||||
## [0.9.0]
|
## [0.9.0]
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
@@ -30,7 +42,6 @@ the unintuitive inclusion of the whitespaces in the produced offsets, even if th
|
|||||||
part of the actual token
|
part of the actual token
|
||||||
- More alignment mappings on the `Encoding`.
|
- More alignment mappings on the `Encoding`.
|
||||||
- `post_process` can be called on the `Tokenizer`
|
- `post_process` can be called on the `Tokenizer`
|
||||||
- [#208]: Ability to retrieve the vocabulary from the `Tokenizer` & `Model`
|
|
||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
- [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
|
- [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
|
||||||
@@ -39,7 +50,6 @@ part of the actual token
|
|||||||
- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
|
- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
|
||||||
- [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not
|
- [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not
|
||||||
advised, but that's not the question)
|
advised, but that's not the question)
|
||||||
- [#205]: Trim the decoded string in `BPEDecoder`
|
|
||||||
|
|
||||||
### How to migrate
|
### How to migrate
|
||||||
- Add the `ByteLevel` `PostProcessor` to your byte-level BPE tokenizers if relevant.
|
- Add the `ByteLevel` `PostProcessor` to your byte-level BPE tokenizers if relevant.
|
||||||
@@ -55,6 +65,8 @@ advised, but that's not the question)
|
|||||||
split up in multiple bytes
|
split up in multiple bytes
|
||||||
- [#174]: The `LongestFirst` truncation strategy had a bug
|
- [#174]: The `LongestFirst` truncation strategy had a bug
|
||||||
|
|
||||||
|
[b770f36](https://github.com/huggingface/tokenizers/commit/b770f364280af33efeffea8f0003102cda8cf1b7)
|
||||||
|
[#222]: https://github.com/huggingface/tokenizers/pull/222
|
||||||
[#208]: https://github.com/huggingface/tokenizers/pull/208
|
[#208]: https://github.com/huggingface/tokenizers/pull/208
|
||||||
[#205]: https://github.com/huggingface/tokenizers/issues/205
|
[#205]: https://github.com/huggingface/tokenizers/issues/205
|
||||||
[#197]: https://github.com/huggingface/tokenizers/pull/197
|
[#197]: https://github.com/huggingface/tokenizers/pull/197
|
||||||
|
|||||||
Reference in New Issue
Block a user