mirror of
https://github.com/mii443/tokenizers.git
synced 2025-12-09 06:08:22 +00:00
Fix stripping strings containing Unicode characters (#707)
* Strip seems to have been broken for a while on unicode strings. - Includes a failing tests + fixed it. - This function could maybe b optimized, we're scanning the string 3 times now. and once fully for chars. * Update CHANGELOG.md Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
This commit is contained in:
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
|
||||
### Fixed
|
||||
- [#686]: Fix SPM conversion process for whitespace deduplication
|
||||
- [#707]: Fix stripping strings containing Unicode characters
|
||||
|
||||
### Added
|
||||
- [#693]: Add a CTC Decoder for Wave2Vec models
|
||||
@@ -317,6 +318,7 @@ delimiter (Works like `.split(delimiter)`)
|
||||
- Fix a bug that was causing crashes in Python 3.5
|
||||
|
||||
|
||||
[#707]: https://github.com/huggingface/tokenizers/pull/707
|
||||
[#693]: https://github.com/huggingface/tokenizers/pull/693
|
||||
[#686]: https://github.com/huggingface/tokenizers/pull/686
|
||||
[#674]: https://github.com/huggingface/tokenizers/pull/674
|
||||
|
||||
Reference in New Issue
Block a user