1870 Commits

Author SHA1 Message Date
f4df7f5e2a Update Tokenizer with NormalizedString & Encoding 2019-12-28 15:28:44 -05:00
4afcb1ef96 PreTokenizers handle offsets 2019-12-28 15:28:21 -05:00
8c40c89836 Encoding uses NormalizedString 2019-12-28 15:25:50 -05:00
162829b7a9 Introduce NormalizedString 2019-12-28 15:24:09 -05:00
96ef467bbf Use forked unicode-normalization 2019-12-28 15:22:52 -05:00
a4beecf944 WordPiece handles offsets 2019-12-28 15:22:03 -05:00
5682627223 BPE handles offsets 2019-12-28 15:21:50 -05:00
5d9848ad6c Models handles offsets 2019-12-28 15:21:29 -05:00
839239d3b4 Bump version 2019-12-27 10:43:34 -05:00
bddf7ba737 Python - Fix building from wheels 2019-12-27 10:39:19 -05:00
ffd28ba558 Bump for release 2019-12-26 14:56:13 -05:00
74cc6f6bde Python - Simplify padding interface 2019-12-26 14:34:13 -05:00
d1e59e09bf Fix a bug when adding special tokens
If we add special tokens that are part of the vocabulary of the model, the tokens aren't added to the tokenizer, which then built an empty regex. This completely break the tokenization
2019-12-26 14:32:50 -05:00
d93d4fc3cd Python - Simplify truncation interface 2019-12-26 10:35:20 -05:00
a7734ffc9f Python - Update doc and readme for add_prefix_space 2019-12-26 10:34:53 -05:00
1879cb0bcb Python - change with_added_tokens as kwarg 2019-12-25 22:22:35 -05:00
905c1eb77e Python - update some packages 2019-12-25 22:16:43 -05:00
597031b973 Python - remove unused variable 2019-12-25 22:16:11 -05:00
9d289d357d Python - change add_prefix_space as kwarg 2019-12-25 22:10:17 -05:00
4bc5a7bbe7 Python - fix example 2019-12-24 11:20:40 -05:00
cf0e8917cd Fix whitespace handling in ByteLevel 2019-12-24 11:20:26 -05:00
9f1421a04b remove Cargo.lock (#7) 2019-12-23 21:22:42 -08:00
c0ed873c4d simplify initialization of BpeTrainer 2019-12-23 20:13:48 -05:00
fab1d4cabc Bump version for release 2019-12-23 17:28:38 -05:00
e01d4f2052 Python - Remove misleading __repr__ 2019-12-23 17:27:59 -05:00
2159123d7c Fix truncate 2019-12-23 17:27:43 -05:00
8fb94be3d0 Merge pull request #6 from huggingface/BPE-tests
Add BPE tests and documentation
2019-12-20 15:34:38 -05:00
9a91016877 Merge branch 'master' into BPE-tests 2019-12-20 08:55:41 -08:00
2266960ef7 Bump version and update Readme 2019-12-20 10:26:40 -05:00
f2b9c30ad9 Handle vocab size with added tokens 2019-12-19 20:19:56 -05:00
b7040e0412 Option to skip special tokens while decoding 2019-12-19 20:03:02 -05:00
a8d68d516d Handle special tokens 2019-12-19 19:48:16 -05:00
7f032b62df Include the added tokens while converting tokens and ids 2019-12-19 18:32:37 -05:00
076ba297fb Cannot add new tokens that already exist in the vocab 2019-12-19 18:32:03 -05:00
6d51e7a393 add example / doc test for BPE trainer 2019-12-19 15:28:58 -08:00
69212e17e9 formatting 2019-12-19 15:07:27 -08:00
a16daa78f1 add test for word merge 2019-12-19 14:45:38 -08:00
184b09e3ac add more tests 2019-12-18 17:40:13 -08:00
1dc0debe36 add initial test 2019-12-18 16:45:11 -08:00
9763282d59 Bump version for release 2019-12-17 18:42:34 -05:00
4d14b08afe ByteLevel handles prefix spaces 2019-12-17 18:35:40 -05:00
6766585965 Python - Do not expose non working features of Encoding 2019-12-17 17:43:42 -05:00
0a3d4a86a9 Python - Update bindings for BertPreTokenizer 2019-12-17 17:40:56 -05:00
e54eee7657 BasicPreTokenizer => BertPreTokenizer 2019-12-17 17:37:13 -05:00
1b66d87fd3 BasicPreTokenizer handles do_basic_tokenize for Bert 2019-12-17 17:35:26 -05:00
3f95248d6d Python - Truncation & padding bindings 2019-12-17 17:24:53 -05:00
5729d3656a Tokenizer handles Truncation and Padding 2019-12-17 15:15:58 -05:00
4c51399b00 An Encoding can be padded 2019-12-17 14:23:37 -05:00
08eb163415 Bump version for release 2019-12-16 19:38:33 -05:00
d80f752ec9 Python - Add some missing Encoding bindings 2019-12-16 19:38:18 -05:00