From f8b1630aa6ea85973f47e2a5d76db3397b0c8c87 Mon Sep 17 00:00:00 2001 From: Anthony MOI Date: Tue, 23 Jun 2020 13:32:21 -0400 Subject: [PATCH] Update CHANGELOGs --- bindings/python/CHANGELOG.md | 6 ++++++ tokenizers/CHANGELOG.md | 3 +++ 2 files changed, 9 insertions(+) diff --git a/bindings/python/CHANGELOG.md b/bindings/python/CHANGELOG.md index 7cfa5f3f..78ff53cb 100644 --- a/bindings/python/CHANGELOG.md +++ b/bindings/python/CHANGELOG.md @@ -18,6 +18,10 @@ This adds some methods to easily save/load an entire tokenizer (`from_str`, `fro activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with `enable_padding(pad_to_multiple_of=8)` for example. - [#298]: Ability to get the currently set truncation/padding params +- [#311]: Ability to enable/disable the parallelism using the `TOKENIZERS_PARALLELISM` environment +variable. This is especially usefull when using `multiprocessing` capabilities, with the `fork` +start method, which happens to be the default on Linux systems. Without disabling the parallelism, +the process dead-locks while encoding. (Cf [#187] for more information) ### Changed - Improved errors generated during truncation: When the provided max length is too low are @@ -190,6 +194,7 @@ delimiter (Works like `.split(delimiter)`) - Fix a bug with the IDs associated with added tokens. - Fix a bug that was causing crashes in Python 3.5 +[#311]: https://github.com/huggingface/tokenizers/pull/311 [#309]: https://github.com/huggingface/tokenizers/pull/309 [#289]: https://github.com/huggingface/tokenizers/pull/289 [#286]: https://github.com/huggingface/tokenizers/pull/286 @@ -207,6 +212,7 @@ delimiter (Works like `.split(delimiter)`) [#193]: https://github.com/huggingface/tokenizers/pull/193 [#190]: https://github.com/huggingface/tokenizers/pull/190 [#188]: https://github.com/huggingface/tokenizers/pull/188 +[#187]: https://github.com/huggingface/tokenizers/issues/187 [#175]: https://github.com/huggingface/tokenizers/issues/175 [#174]: https://github.com/huggingface/tokenizers/issues/174 [#165]: https://github.com/huggingface/tokenizers/pull/165 diff --git a/tokenizers/CHANGELOG.md b/tokenizers/CHANGELOG.md index 834b0806..5d8e2de9 100644 --- a/tokenizers/CHANGELOG.md +++ b/tokenizers/CHANGELOG.md @@ -43,6 +43,8 @@ using serde. It is now easy to save/load an entire tokenizer. - [#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure activation of the Tensor Cores, while ensuring padding to a multiple of 8. - [#298]: Ability to get the currently set truncation/padding params +- [#311]: Ability to enable/disable the parallelism using the `TOKENIZERS_PARALLELISM` environment +variable. ### How to migrate - Replace any `XXX_to_YYY_offsets()` method call by any of the new ones. @@ -117,6 +119,7 @@ advised, but that's not the question) split up in multiple bytes - [#174]: The `LongestFirst` truncation strategy had a bug +[#311]: https://github.com/huggingface/tokenizers/pull/311 [#309]: https://github.com/huggingface/tokenizers/pull/309 [#298]: https://github.com/huggingface/tokenizers/pull/298 [#289]: https://github.com/huggingface/tokenizers/pull/289