Python - Update CHANGELOG

2025-12-16 17:18:43 +00:00 · 2020-02-26 09:31:17 -05:00
parent 61b4c9c30a
commit 2425fe877d
1 changed files with 13 additions and 9 deletions
--- a/bindings/python/CHANGELOG.md
+++ b/bindings/python/CHANGELOG.md
@@ -1,3 +1,7 @@
+# v0.6.0 (not published yet)
+
+Fixes:
+- Some default tokens were missing from `BertWordPieceTokenizer` (cf [#160](https://github.com/huggingface/tokenizers/issues/160))
 # v0.5.2

 ## Fixes:
@@ -14,16 +18,16 @@ the files get a more generic naming, like `vocab.json` or `merges.txt`.
 # v0.5.0

 ## Changes:
- `BertWordPieceTokenizer` now cleans up some tokenization artifacts while decoding (cf #145)
- `ByteLevelBPETokenizer` now has `dropout` (thanks @colinclement with #149)
+- `BertWordPieceTokenizer` now cleans up some tokenization artifacts while decoding (cf [#145](https://github.com/huggingface/tokenizers/issues/145))
+- `ByteLevelBPETokenizer` now has `dropout` (thanks @colinclement with [#149](https://github.com/huggingface/tokenizers/issues/149))
 - Added a new `Strip` normalizer
 - `do_lowercase` has been changed to `lowercase` for consistency between the different tokenizers. (Especially `ByteLevelBPETokenizer` and `CharBPETokenizer`)
- Expose `__len__` on `Encoding` (cf #139)
+- Expose `__len__` on `Encoding` (cf [#139](https://github.com/huggingface/tokenizers/issues/139))
 - Improved padding performances.

 ## Fixes:
- #145: Decoding was buggy on `BertWordPieceTokenizer`.
- #152: Some documentation and examples were still using the old `BPETokenizer`
+- [#145](https://github.com/huggingface/tokenizers/issues/145): Decoding was buggy on `BertWordPieceTokenizer`.
+- [#152](https://github.com/huggingface/tokenizers/issues/152): Some documentation and examples were still using the old `BPETokenizer`

 ## How to migrate:
 - Use `lowercase` when initializing `ByteLevelBPETokenizer` or `CharBPETokenizer` instead of `do_lowercase`.
@@ -31,17 +35,17 @@ the files get a more generic naming, like `vocab.json` or `merges.txt`.
 # v0.4.2

 ## Fixes:
- Fix a bug in the class `WordPieceTrainer` that prevented `BertWordPieceTokenizer` from being trained. (cf #137)
+- Fix a bug in the class `WordPieceTrainer` that prevented `BertWordPieceTokenizer` from being trained. (cf [#137](https://github.com/huggingface/tokenizers/issues/137))

 # v0.4.1

 ## Fixes:
- Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)
+- Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with [#134](https://github.com/huggingface/tokenizers/issues/134))

 # v0.4.0

 ## Changes:
- Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with #131)
+- Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with [#131](https://github.com/huggingface/tokenizers/issues/131))
 - Improved typings

 ## How to migrate:
@@ -59,7 +63,7 @@ the files get a more generic naming, like `vocab.json` or `merges.txt`.
 output = tokenizer.encode(...)
 print(output.original_str.offsets(output.offsets[3]))
 ```
- Exposed the vocabulary size on all tokenizers: https://github.com/huggingface/tokenizers/pull/99 by @kdexd
+- Exposed the vocabulary size on all tokenizers: [#99](https://github.com/huggingface/tokenizers/pull/99) by @kdexd

 ## Fixes:
 - Fix a bug with IndexableString