mirror of
https://github.com/mii443/tokenizers.git
synced 2025-12-16 17:18:43 +00:00
Python - Update CHANGELOG
This commit is contained in:
@@ -1,3 +1,7 @@
|
||||
# v0.6.0 (not published yet)
|
||||
|
||||
Fixes:
|
||||
- Some default tokens were missing from `BertWordPieceTokenizer` (cf [#160](https://github.com/huggingface/tokenizers/issues/160))
|
||||
# v0.5.2
|
||||
|
||||
## Fixes:
|
||||
@@ -14,16 +18,16 @@ the files get a more generic naming, like `vocab.json` or `merges.txt`.
|
||||
# v0.5.0
|
||||
|
||||
## Changes:
|
||||
- `BertWordPieceTokenizer` now cleans up some tokenization artifacts while decoding (cf #145)
|
||||
- `ByteLevelBPETokenizer` now has `dropout` (thanks @colinclement with #149)
|
||||
- `BertWordPieceTokenizer` now cleans up some tokenization artifacts while decoding (cf [#145](https://github.com/huggingface/tokenizers/issues/145))
|
||||
- `ByteLevelBPETokenizer` now has `dropout` (thanks @colinclement with [#149](https://github.com/huggingface/tokenizers/issues/149))
|
||||
- Added a new `Strip` normalizer
|
||||
- `do_lowercase` has been changed to `lowercase` for consistency between the different tokenizers. (Especially `ByteLevelBPETokenizer` and `CharBPETokenizer`)
|
||||
- Expose `__len__` on `Encoding` (cf #139)
|
||||
- Expose `__len__` on `Encoding` (cf [#139](https://github.com/huggingface/tokenizers/issues/139))
|
||||
- Improved padding performances.
|
||||
|
||||
## Fixes:
|
||||
- #145: Decoding was buggy on `BertWordPieceTokenizer`.
|
||||
- #152: Some documentation and examples were still using the old `BPETokenizer`
|
||||
- [#145](https://github.com/huggingface/tokenizers/issues/145): Decoding was buggy on `BertWordPieceTokenizer`.
|
||||
- [#152](https://github.com/huggingface/tokenizers/issues/152): Some documentation and examples were still using the old `BPETokenizer`
|
||||
|
||||
## How to migrate:
|
||||
- Use `lowercase` when initializing `ByteLevelBPETokenizer` or `CharBPETokenizer` instead of `do_lowercase`.
|
||||
@@ -31,17 +35,17 @@ the files get a more generic naming, like `vocab.json` or `merges.txt`.
|
||||
# v0.4.2
|
||||
|
||||
## Fixes:
|
||||
- Fix a bug in the class `WordPieceTrainer` that prevented `BertWordPieceTokenizer` from being trained. (cf #137)
|
||||
- Fix a bug in the class `WordPieceTrainer` that prevented `BertWordPieceTokenizer` from being trained. (cf [#137](https://github.com/huggingface/tokenizers/issues/137))
|
||||
|
||||
# v0.4.1
|
||||
|
||||
## Fixes:
|
||||
- Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)
|
||||
- Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with [#134](https://github.com/huggingface/tokenizers/issues/134))
|
||||
|
||||
# v0.4.0
|
||||
|
||||
## Changes:
|
||||
- Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with #131)
|
||||
- Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with [#131](https://github.com/huggingface/tokenizers/issues/131))
|
||||
- Improved typings
|
||||
|
||||
## How to migrate:
|
||||
@@ -59,7 +63,7 @@ the files get a more generic naming, like `vocab.json` or `merges.txt`.
|
||||
output = tokenizer.encode(...)
|
||||
print(output.original_str.offsets(output.offsets[3]))
|
||||
```
|
||||
- Exposed the vocabulary size on all tokenizers: https://github.com/huggingface/tokenizers/pull/99 by @kdexd
|
||||
- Exposed the vocabulary size on all tokenizers: [#99](https://github.com/huggingface/tokenizers/pull/99) by @kdexd
|
||||
|
||||
## Fixes:
|
||||
- Fix a bug with IndexableString
|
||||
|
||||
Reference in New Issue
Block a user