Commit Graph

36 Commits

Author SHA1 Message Date
2dc48e56ac Python - Update pyo3 version
* Use __new__ instead of static method as model constructors
2020-04-06 21:20:16 +02:00
477037fd6b Python - Improve AddedToken repr 2020-04-01 17:25:55 -04:00
b055b77b54 Python - Add first tests: Tokenizer 2020-04-01 17:25:55 -04:00
a2a6d80017 Python - expost get_vocab on Tokenizer 2020-03-27 11:53:18 -04:00
9bd9e0b3c1 Expose post_process on the Tokenizer 2020-03-26 15:42:45 -04:00
f8d54edcdd Python - Fix cases where str expected instead of AddedToken 2020-03-25 19:22:53 -04:00
c65d53892d Python - Add bindings for new AddedToken options 2020-03-24 20:58:45 -04:00
60a4fb35f4 Python - Update bindings 2020-03-16 10:36:42 -04:00
257360acec Python - encode & encode batch with add_special_tokens 2020-03-10 16:21:10 -04:00
f263d7651f Python - RustFmt 2020-02-18 15:07:34 -05:00
c4bac6aeeb Expose num_added_tokens on Python side (#146)
* Expose num_added_tokens on Python side without the need to pass an Encoding to added_tokens.

This allows to compute the max sentence length for single/pair inputs without actually the need to have an Encoding structure.
As the number of added tokens is fixed and static during compilation it allows more flexible usage of the method.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed num_added_tokens to num_special_tokens_to_add.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-14 10:55:20 +00:00
4839154145 Remove kwargs mapping on Tokenizer decode/decode_batch as their is only one possible arg.
This is suggested by the current issue https://github.com/huggingface/tokenizers/issues/54#issuecomment-574104841.

kwargs cannot be called as positional argument, they have to be named one, replacing kwargs with the actual skip_special_tokens
allows both (named and positional) syntax.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-15 11:16:01 +01:00
fc56f8d186 Python - Update some naming 2020-01-08 09:54:03 -05:00
8bbf832842 Python - Use Getter/Setter to get/modify Tokenizer's parts 2020-01-07 15:17:23 -05:00
b7d0acc562 Python - Improve decode/decode_batch API 2020-01-06 16:39:36 -05:00
90dfdc715d Expose Tokenizer parts 2019-12-31 22:57:47 -05:00
3f79d9d5e0 Python - Add normalizers bindings & BertNormalizer 2019-12-29 00:36:09 -05:00
74cc6f6bde Python - Simplify padding interface 2019-12-26 14:34:13 -05:00
d93d4fc3cd Python - Simplify truncation interface 2019-12-26 10:35:20 -05:00
1879cb0bcb Python - change with_added_tokens as kwarg 2019-12-25 22:22:35 -05:00
f2b9c30ad9 Handle vocab size with added tokens 2019-12-19 20:19:56 -05:00
b7040e0412 Option to skip special tokens while decoding 2019-12-19 20:03:02 -05:00
a8d68d516d Handle special tokens 2019-12-19 19:48:16 -05:00
3f95248d6d Python - Truncation & padding bindings 2019-12-17 17:24:53 -05:00
93a74aa53a Python - Expose PostProcessors 2019-12-16 18:46:14 -05:00
1a90cc96e5 Python - Can add tokens 2019-12-16 18:45:26 -05:00
ed7e3999d2 Python - Fix some clippy warnings 2019-12-13 18:17:51 -05:00
2a0ad97809 Python - Update API to allow failure 2019-12-13 12:20:05 -05:00
b4b31d73cd Expose vocabulary size 2019-12-10 16:20:31 -05:00
6c294c60b0 Python - Add Encoding repr + improve example 2019-12-10 15:18:07 -05:00
8cedc5f1f6 Update Python bindings for Encoding 2019-12-10 12:38:36 -05:00
849272d44f Python - add missing modules exports 2019-12-09 12:50:53 -05:00
eaafb22511 Add bindings for Trainer in Python 2019-12-03 15:54:15 -05:00
8fbe3c2662 Python - Add decoders 2019-11-22 21:08:57 -05:00
e44f52024c Python - Set a PreTokenizer in a model 2019-11-22 21:01:52 -05:00
39a6d04c53 Improve Python bindings
This is an attempt at actually exposing the same structure that we use in the Rust lib. This will allow Python to instantiate Model/PreTokenizer/... with their own arguments, combining everything without relying on parsed kwargs.
2019-11-22 17:57:36 -05:00