mirror of
https://github.com/mii443/tokenizers.git
synced 2025-12-07 13:18:31 +00:00
Update CHANGELOGs
This commit is contained in:
@@ -6,18 +6,28 @@ a high number of files as it avoids having too many progress bars on screen.
|
|||||||
- `ByteLevel` is also a `PostProcessor` now and handles trimming the offsets if activated. This
|
- `ByteLevel` is also a `PostProcessor` now and handles trimming the offsets if activated. This
|
||||||
avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
|
avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
|
||||||
whitespaces are part of the actual token.
|
whitespaces are part of the actual token.
|
||||||
It has been added to `ByteLevelBPETokenizer` and but it is off by default (`trim_offsets=False`).
|
It has been added to `ByteLevelBPETokenizer` but it is off by default (`trim_offsets=False`).
|
||||||
- `encode` and `encode_batch` no take a new optional argument, specifying whether we should add the
|
([#188](https://github.com/huggingface/tokenizers/pull/188))
|
||||||
special tokens. This stays activated by default.
|
- `encode` and `encode_batch` now take a new optional argument, specifying whether we should add the
|
||||||
|
special tokens. This is activated by default. ([#193](https://github.com/huggingface/tokenizers/pull/193))
|
||||||
|
- `original_str` and `normalized_str` have been removed from the `Encoding` returned by `encode` and
|
||||||
|
`encode_batch`. This brings a reduction of 70% the memory footprint.
|
||||||
|
([#197](https://github.com/huggingface/tokenizers/pull/197))
|
||||||
|
|
||||||
## Fixes:
|
## Fixes:
|
||||||
- Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
|
- Fix some issues with the offsets being wrong with the `ByteLevel` BPE ([#193](https://github.com/huggingface/tokenizers/pull/193)):
|
||||||
- when `add_prefix_space=True`
|
- when `add_prefix_space=True`
|
||||||
- when a Unicode character gets split-up in multiple byte-level characters ([#156](https://github.com/huggingface/tokenizers/issues/156))
|
- when a Unicode character gets split-up in multiple byte-level characters ([#156](https://github.com/huggingface/tokenizers/issues/156))
|
||||||
|
|
||||||
## How to migrate:
|
## How to migrate:
|
||||||
- Add the `ByteLevel` `PostProcessor` to your byte-level BPE tokenizers if relevant. If you are
|
- Add the `ByteLevel` `PostProcessor` to your byte-level BPE tokenizers if relevant. If you are
|
||||||
using `ByteLevelBPETokenizer`, this option is disabled by default (`trim_offsets=False`).
|
using `ByteLevelBPETokenizer`, this option is disabled by default (`trim_offsets=False`).
|
||||||
|
- Access to the `original_str` on the `Encoding` has been removed. The original string is the input
|
||||||
|
of `encode` so it didn't make sense to keep it here.
|
||||||
|
- No need to call `original_str.offsets(offsets[N])` to convert offsets to the original string. They
|
||||||
|
are now relative to the original string by default.
|
||||||
|
- Access to the `normalized_str` on the `Encoding` has been removed. Can be retrieved by calling
|
||||||
|
`normalize(sequence)` on the `Tokenizer`
|
||||||
|
|
||||||
# v0.6.0
|
# v0.6.0
|
||||||
|
|
||||||
|
|||||||
@@ -6,9 +6,16 @@ a high number of files as it avoids having too many progress bars on screen.
|
|||||||
- Improve BPE and WordPiece builders.
|
- Improve BPE and WordPiece builders.
|
||||||
- `ByteLevel` is also a `PostProcessor` now and handles trimming the offsets if activated. This
|
- `ByteLevel` is also a `PostProcessor` now and handles trimming the offsets if activated. This
|
||||||
avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
|
avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
|
||||||
whitespaces are part of the actual token.
|
whitespaces are part of the actual token. ([#188](https://github.com/huggingface/tokenizers/pull/188))
|
||||||
- `encode` and `encode_batch` now take a new argument, specifying whether we should add the
|
- `encode` and `encode_batch` now take a new argument, specifying whether we should add the
|
||||||
special tokens.
|
special tokens. ([#193](https://github.com/huggingface/tokenizers/pull/193))
|
||||||
|
- The `NormalizedString` has been removed from the `Encoding`. It is now possible to retrieve it
|
||||||
|
by calling `normalized` on the `Tokenizer`. This brings a reduction of 70% of the memory footprint
|
||||||
|
([#197](https://github.com/huggingface/tokenizers/pull/197))
|
||||||
|
- The `NormalizedString` API has been improved. It is now possible to retrieve part of both strings
|
||||||
|
using both "normalized" or "original" offsets. ([#197](https://github.com/huggingface/tokenizers/pull/197))
|
||||||
|
- The offsets provided on `Encoding` are now relative to the original string, and not the normalized
|
||||||
|
one anymore. ([#197](https://github.com/huggingface/tokenizers/pull/197))
|
||||||
|
|
||||||
## Fixes:
|
## Fixes:
|
||||||
- Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
|
- Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
|
||||||
|
|||||||
Reference in New Issue
Block a user