Python - Update CHANGELOG

This commit is contained in:
Anthony MOI
2020-02-26 09:31:17 -05:00
parent 61b4c9c30a
commit 2425fe877d

View File

@@ -1,3 +1,7 @@
# v0.6.0 (not published yet)
Fixes:
- Some default tokens were missing from `BertWordPieceTokenizer` (cf [#160](https://github.com/huggingface/tokenizers/issues/160))
# v0.5.2
## Fixes:
@@ -14,16 +18,16 @@ the files get a more generic naming, like `vocab.json` or `merges.txt`.
# v0.5.0
## Changes:
- `BertWordPieceTokenizer` now cleans up some tokenization artifacts while decoding (cf #145)
- `ByteLevelBPETokenizer` now has `dropout` (thanks @colinclement with #149)
- `BertWordPieceTokenizer` now cleans up some tokenization artifacts while decoding (cf [#145](https://github.com/huggingface/tokenizers/issues/145))
- `ByteLevelBPETokenizer` now has `dropout` (thanks @colinclement with [#149](https://github.com/huggingface/tokenizers/issues/149))
- Added a new `Strip` normalizer
- `do_lowercase` has been changed to `lowercase` for consistency between the different tokenizers. (Especially `ByteLevelBPETokenizer` and `CharBPETokenizer`)
- Expose `__len__` on `Encoding` (cf #139)
- Expose `__len__` on `Encoding` (cf [#139](https://github.com/huggingface/tokenizers/issues/139))
- Improved padding performances.
## Fixes:
- #145: Decoding was buggy on `BertWordPieceTokenizer`.
- #152: Some documentation and examples were still using the old `BPETokenizer`
- [#145](https://github.com/huggingface/tokenizers/issues/145): Decoding was buggy on `BertWordPieceTokenizer`.
- [#152](https://github.com/huggingface/tokenizers/issues/152): Some documentation and examples were still using the old `BPETokenizer`
## How to migrate:
- Use `lowercase` when initializing `ByteLevelBPETokenizer` or `CharBPETokenizer` instead of `do_lowercase`.
@@ -31,17 +35,17 @@ the files get a more generic naming, like `vocab.json` or `merges.txt`.
# v0.4.2
## Fixes:
- Fix a bug in the class `WordPieceTrainer` that prevented `BertWordPieceTokenizer` from being trained. (cf #137)
- Fix a bug in the class `WordPieceTrainer` that prevented `BertWordPieceTokenizer` from being trained. (cf [#137](https://github.com/huggingface/tokenizers/issues/137))
# v0.4.1
## Fixes:
- Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)
- Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with [#134](https://github.com/huggingface/tokenizers/issues/134))
# v0.4.0
## Changes:
- Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with #131)
- Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with [#131](https://github.com/huggingface/tokenizers/issues/131))
- Improved typings
## How to migrate:
@@ -59,7 +63,7 @@ the files get a more generic naming, like `vocab.json` or `merges.txt`.
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
```
- Exposed the vocabulary size on all tokenizers: https://github.com/huggingface/tokenizers/pull/99 by @kdexd
- Exposed the vocabulary size on all tokenizers: [#99](https://github.com/huggingface/tokenizers/pull/99) by @kdexd
## Fixes:
- Fix a bug with IndexableString