diff --git a/bindings/python/CHANGELOG.md b/bindings/python/CHANGELOG.md index e6061e1b..f8184439 100644 --- a/bindings/python/CHANGELOG.md +++ b/bindings/python/CHANGELOG.md @@ -28,6 +28,8 @@ whitespaces are part of the actual token. It has been added to `ByteLevelBPETokenizer` but it is off by default (`trim_offsets=False`). - More alignment mappings on the `Encoding`. - `post_process` can be called on the `Tokenizer` +- [#208]: Ability to retrieve the vocabulary from the `Tokenizer` with +`get_vocab(with_added_tokens: bool)` ### Fixed - [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE: @@ -148,6 +150,7 @@ delimiter (Works like `.split(delimiter)`) - Fix a bug with the IDs associated with added tokens. - Fix a bug that was causing crashes in Python 3.5 +[#208]: https://github.com/huggingface/tokenizers/pull/208 [#197]: https://github.com/huggingface/tokenizers/pull/197 [#193]: https://github.com/huggingface/tokenizers/pull/193 [#190]: https://github.com/huggingface/tokenizers/pull/190 diff --git a/tokenizers/CHANGELOG.md b/tokenizers/CHANGELOG.md index 77a5dd8c..965458f1 100644 --- a/tokenizers/CHANGELOG.md +++ b/tokenizers/CHANGELOG.md @@ -30,6 +30,7 @@ the unintuitive inclusion of the whitespaces in the produced offsets, even if th part of the actual token - More alignment mappings on the `Encoding`. - `post_process` can be called on the `Tokenizer` +- [#208]: Ability to retrieve the vocabulary from the `Tokenizer` & `Model` ### Fixed - [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE: @@ -53,6 +54,7 @@ advised, but that's not the question) split up in multiple bytes - [#174]: The `LongestFirst` truncation strategy had a bug +[#208]: https://github.com/huggingface/tokenizers/pull/208 [#197]: https://github.com/huggingface/tokenizers/pull/197 [#193]: https://github.com/huggingface/tokenizers/pull/193 [#190]: https://github.com/huggingface/tokenizers/pull/190