Commit Graph

151 Commits

Author SHA1 Message Date
1a802cb484 fix typos 2020-01-10 10:47:36 +01:00
d46ea842c2 Python - IndexableString accepts tuples directly 2020-01-10 00:32:30 -05:00
be10f542ce Added SentencePiece and YouTokenToMe model extractors.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-08 22:55:00 +01:00
3af2a43cae Hotfix Python bindings 2020-01-08 16:20:05 -05:00
ef21c9a7b0 Hotfix for new Builder
cc @epwalsh
2020-01-08 16:19:51 -05:00
c7d2800131 Python - Add model saving to base tokenizer 2020-01-08 14:44:17 -05:00
bbe31f9237 Quick README update 2020-01-08 14:07:48 -05:00
988159a998 Hotfix Python bindings for 32-bit systems 2020-01-08 13:42:35 -05:00
383123e21f Bump version 2020-01-08 11:02:40 -05:00
bc48a89770 Python - Handle training on custom classes 2020-01-08 10:33:59 -05:00
fc56f8d186 Python - Update some naming 2020-01-08 09:54:03 -05:00
882df9b8e2 better repr for tokenizers 2020-01-08 12:06:46 +01:00
111c2d152c add option to remove special tokens 2020-01-08 11:48:47 +01:00
af6a685664 fix add_special_tokens 2020-01-08 11:48:37 +01:00
b16ee75b97 Add BertWordPieceTokenizer 2020-01-08 00:32:13 -05:00
88711d5717 Python - IndexableString in Encoding 2020-01-08 00:06:57 -05:00
dc76e11768 Python - Provide __repr__ for Encoding 2020-01-07 21:33:45 -05:00
05f683ce23 Add SentencePieceBPETokenizer 2020-01-07 20:30:15 -05:00
ee115df65e Add the original BPETokenizer 2020-01-07 19:58:48 -05:00
243a45af40 Add BPEDecoder 2020-01-07 19:56:49 -05:00
5bc1e2ee05 Add Lowercase Normalizer 2020-01-07 19:40:19 -05:00
099bb8e596 Python - Dropout and unk_token optional 2020-01-07 19:34:36 -05:00
03c431c60e Modify BPE with unk_token being a String 2020-01-07 19:22:29 -05:00
b17f9d8872 Rename ByteLevelBPE
Rename ByteLevelBPETokenizer
2020-01-07 18:54:21 -05:00
6d0e3ba8f1 fix imports 2020-01-07 18:54:21 -05:00
63063118df Python - Adding tokenizers classes - WIP 2020-01-07 18:54:21 -05:00
6294d342d5 Hotfix metaspace decoder 2020-01-07 18:53:07 -05:00
cbdd2cf423 Python - add Metaspace decoder 2020-01-07 18:40:18 -05:00
4e026b57a8 Python - quick fix stub file 2020-01-07 16:18:28 -05:00
3f806a2b5f Python - Also update README 2020-01-07 15:24:39 -05:00
cc33418044 Python - Update examples with getter/setter 2020-01-07 15:23:11 -05:00
8bbf832842 Python - Use Getter/Setter to get/modify Tokenizer's parts 2020-01-07 15:17:23 -05:00
eaa23ac8e6 Add the Metaspace PreTokenizer 2020-01-07 12:59:59 -05:00
b06681cb1e Bump version for release 2020-01-06 21:05:01 -05:00
185b6f0b8b Add Sequence Normalizer 2020-01-06 21:03:05 -05:00
5c02bbbc4c Add basic unicode normalizers 2020-01-06 20:38:42 -05:00
4b9ae66419 WordPiece decoder with customizable prefix 2020-01-06 20:20:42 -05:00
772d0680b6 Python - Update all typings 2020-01-06 20:03:00 -05:00
0079a7a6b7 Python - Add NormalizedString + doc/typings 2020-01-06 17:55:22 -05:00
6de04bbaea Python - Add typings/doc for Encoding 2020-01-06 17:23:04 -05:00
7e9e0aa81c Python - Add Tokenizer doc with stub file 2020-01-06 16:40:27 -05:00
9a99e2bcb1 Python - Add missing Bpe constructor kwargs 2020-01-06 16:39:59 -05:00
b7d0acc562 Python - Improve decode/decode_batch API 2020-01-06 16:39:36 -05:00
1a083a6e6f Python - Improved stub file for models 2020-01-06 15:55:00 -05:00
0e41e0b327 Python - Include correct packages and stubs 2020-01-06 15:24:17 -05:00
8723f78e6f Python - build-sdist.sh +x mode 2020-01-06 14:24:08 -05:00
d7b6385566 Python - Adding some stub files 2020-01-06 13:04:30 -05:00
7eebd06409 Python - Improve imports 2020-01-06 12:03:01 -05:00
e1caacfce0 Rename package for crates.io 2020-01-04 23:42:32 -05:00
fab4e96b51 Python - Add bert wordpiece training example 2020-01-03 19:37:29 -05:00