mirror of
https://github.com/mii443/tokenizers.git
synced 2025-12-07 13:18:31 +00:00
Update CHANGELOGs
This commit is contained in:
@@ -28,6 +28,8 @@ whitespaces are part of the actual token.
|
|||||||
It has been added to `ByteLevelBPETokenizer` but it is off by default (`trim_offsets=False`).
|
It has been added to `ByteLevelBPETokenizer` but it is off by default (`trim_offsets=False`).
|
||||||
- More alignment mappings on the `Encoding`.
|
- More alignment mappings on the `Encoding`.
|
||||||
- `post_process` can be called on the `Tokenizer`
|
- `post_process` can be called on the `Tokenizer`
|
||||||
|
- [#208]: Ability to retrieve the vocabulary from the `Tokenizer` with
|
||||||
|
`get_vocab(with_added_tokens: bool)`
|
||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
- [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
|
- [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
|
||||||
@@ -148,6 +150,7 @@ delimiter (Works like `.split(delimiter)`)
|
|||||||
- Fix a bug with the IDs associated with added tokens.
|
- Fix a bug with the IDs associated with added tokens.
|
||||||
- Fix a bug that was causing crashes in Python 3.5
|
- Fix a bug that was causing crashes in Python 3.5
|
||||||
|
|
||||||
|
[#208]: https://github.com/huggingface/tokenizers/pull/208
|
||||||
[#197]: https://github.com/huggingface/tokenizers/pull/197
|
[#197]: https://github.com/huggingface/tokenizers/pull/197
|
||||||
[#193]: https://github.com/huggingface/tokenizers/pull/193
|
[#193]: https://github.com/huggingface/tokenizers/pull/193
|
||||||
[#190]: https://github.com/huggingface/tokenizers/pull/190
|
[#190]: https://github.com/huggingface/tokenizers/pull/190
|
||||||
|
|||||||
@@ -30,6 +30,7 @@ the unintuitive inclusion of the whitespaces in the produced offsets, even if th
|
|||||||
part of the actual token
|
part of the actual token
|
||||||
- More alignment mappings on the `Encoding`.
|
- More alignment mappings on the `Encoding`.
|
||||||
- `post_process` can be called on the `Tokenizer`
|
- `post_process` can be called on the `Tokenizer`
|
||||||
|
- [#208]: Ability to retrieve the vocabulary from the `Tokenizer` & `Model`
|
||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
- [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
|
- [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
|
||||||
@@ -53,6 +54,7 @@ advised, but that's not the question)
|
|||||||
split up in multiple bytes
|
split up in multiple bytes
|
||||||
- [#174]: The `LongestFirst` truncation strategy had a bug
|
- [#174]: The `LongestFirst` truncation strategy had a bug
|
||||||
|
|
||||||
|
[#208]: https://github.com/huggingface/tokenizers/pull/208
|
||||||
[#197]: https://github.com/huggingface/tokenizers/pull/197
|
[#197]: https://github.com/huggingface/tokenizers/pull/197
|
||||||
[#193]: https://github.com/huggingface/tokenizers/pull/193
|
[#193]: https://github.com/huggingface/tokenizers/pull/193
|
||||||
[#190]: https://github.com/huggingface/tokenizers/pull/190
|
[#190]: https://github.com/huggingface/tokenizers/pull/190
|
||||||
|
|||||||
Reference in New Issue
Block a user