Commit Graph

164 Commits

Author SHA1 Message Date
MOI Anthony
457e6c9932 Merge pull request #71 from huggingface/python_example_fix
Use the same vocabs in python's example.py
2020-01-15 10:07:34 -05:00
Morgan Funtowicz
374f944e32 Use the same vocabs/merges for Python and Rust comparison.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-15 11:57:34 +01:00
Morgan Funtowicz
4839154145 Remove kwargs mapping on Tokenizer decode/decode_batch as their is only one possible arg.
This is suggested by the current issue https://github.com/huggingface/tokenizers/issues/54#issuecomment-574104841.

kwargs cannot be called as positional argument, they have to be named one, replacing kwargs with the actual skip_special_tokens
allows both (named and positional) syntax.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-15 11:16:01 +01:00
Morgan Funtowicz
894f887444 Updated train_bert_wordpiece.py as well.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-14 13:32:02 +01:00
Morgan Funtowicz
7caf9fd823 Updated train_bytelevel_bpe.py to use the high level Python API.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-14 12:00:50 +01:00
Anthony MOI
fc9e81d4ab Fix split on special tokens & bump version 2020-01-12 02:35:45 -05:00
Anthony MOI
dd569020c1 Bump python version for release 2020-01-10 13:49:26 -05:00
Anthony MOI
89e0d90c8a Python - Final fix of the typings 2020-01-10 13:30:35 -05:00
Pierric Cistac
56878a8e43 fix : 2020-01-10 13:30:35 -05:00
Pierric Cistac
958883af74 fix imports in root __init__.pyi 2020-01-10 13:30:35 -05:00
MOI Anthony
b491c0b8c4 Update Python Readme 2020-01-10 12:18:16 -05:00
Anthony MOI
b27737d97c Python - Typings update 2020-01-10 10:06:24 -05:00
thomwolf
d8f3fba245 fix training and wordpiece 2020-01-10 10:47:50 +01:00
thomwolf
1a802cb484 fix typos 2020-01-10 10:47:36 +01:00
Anthony MOI
d46ea842c2 Python - IndexableString accepts tuples directly 2020-01-10 00:32:30 -05:00
Morgan Funtowicz
be10f542ce Added SentencePiece and YouTokenToMe model extractors.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-08 22:55:00 +01:00
Anthony MOI
3af2a43cae Hotfix Python bindings 2020-01-08 16:20:05 -05:00
Anthony MOI
ef21c9a7b0 Hotfix for new Builder
cc @epwalsh
2020-01-08 16:19:51 -05:00
Anthony MOI
c7d2800131 Python - Add model saving to base tokenizer 2020-01-08 14:44:17 -05:00
Anthony MOI
bbe31f9237 Quick README update 2020-01-08 14:07:48 -05:00
Anthony MOI
988159a998 Hotfix Python bindings for 32-bit systems 2020-01-08 13:42:35 -05:00
Anthony MOI
383123e21f Bump version 2020-01-08 11:02:40 -05:00
Anthony MOI
bc48a89770 Python - Handle training on custom classes 2020-01-08 10:33:59 -05:00
Anthony MOI
fc56f8d186 Python - Update some naming 2020-01-08 09:54:03 -05:00
thomwolf
882df9b8e2 better repr for tokenizers 2020-01-08 12:06:46 +01:00
thomwolf
111c2d152c add option to remove special tokens 2020-01-08 11:48:47 +01:00
thomwolf
af6a685664 fix add_special_tokens 2020-01-08 11:48:37 +01:00
Anthony MOI
b16ee75b97 Add BertWordPieceTokenizer 2020-01-08 00:32:13 -05:00
Anthony MOI
88711d5717 Python - IndexableString in Encoding 2020-01-08 00:06:57 -05:00
Anthony MOI
dc76e11768 Python - Provide __repr__ for Encoding 2020-01-07 21:33:45 -05:00
Anthony MOI
05f683ce23 Add SentencePieceBPETokenizer 2020-01-07 20:30:15 -05:00
Anthony MOI
ee115df65e Add the original BPETokenizer 2020-01-07 19:58:48 -05:00
Anthony MOI
243a45af40 Add BPEDecoder 2020-01-07 19:56:49 -05:00
Anthony MOI
5bc1e2ee05 Add Lowercase Normalizer 2020-01-07 19:40:19 -05:00
Anthony MOI
099bb8e596 Python - Dropout and unk_token optional 2020-01-07 19:34:36 -05:00
Anthony MOI
03c431c60e Modify BPE with unk_token being a String 2020-01-07 19:22:29 -05:00
Anthony MOI
b17f9d8872 Rename ByteLevelBPE
Rename ByteLevelBPETokenizer
2020-01-07 18:54:21 -05:00
thomwolf
6d0e3ba8f1 fix imports 2020-01-07 18:54:21 -05:00
Anthony MOI
63063118df Python - Adding tokenizers classes - WIP 2020-01-07 18:54:21 -05:00
Anthony MOI
6294d342d5 Hotfix metaspace decoder 2020-01-07 18:53:07 -05:00
Anthony MOI
cbdd2cf423 Python - add Metaspace decoder 2020-01-07 18:40:18 -05:00
Anthony MOI
4e026b57a8 Python - quick fix stub file 2020-01-07 16:18:28 -05:00
Anthony MOI
3f806a2b5f Python - Also update README 2020-01-07 15:24:39 -05:00
Anthony MOI
cc33418044 Python - Update examples with getter/setter 2020-01-07 15:23:11 -05:00
Anthony MOI
8bbf832842 Python - Use Getter/Setter to get/modify Tokenizer's parts 2020-01-07 15:17:23 -05:00
Anthony MOI
eaa23ac8e6 Add the Metaspace PreTokenizer 2020-01-07 12:59:59 -05:00
Anthony MOI
b06681cb1e Bump version for release 2020-01-06 21:05:01 -05:00
Anthony MOI
185b6f0b8b Add Sequence Normalizer 2020-01-06 21:03:05 -05:00
Anthony MOI
5c02bbbc4c Add basic unicode normalizers 2020-01-06 20:38:42 -05:00
Anthony MOI
4b9ae66419 WordPiece decoder with customizable prefix 2020-01-06 20:20:42 -05:00