Commit Graph

201 Commits

Author SHA1 Message Date
f3aef0e4e6 Fix BPE saving (u32 => String) 2019-12-31 23:15:10 -05:00
90dfdc715d Expose Tokenizer parts 2019-12-31 22:57:47 -05:00
90df088054 Fix ByteLevel PreTokenizer
I broke it with my last changes. We cannot take a slice of a string by indexing on bytes obviously.
2019-12-31 15:09:51 -05:00
f28ca58fd9 [Fix #17] BPE & WordPiece models saving 2019-12-31 13:56:28 -05:00
2125e4d422 Merge pull request #21 from huggingface/dropout
Implement dropout for BPE
2019-12-30 19:39:29 -05:00
b21a5496a7 no cache when dropout 2019-12-30 15:58:16 -08:00
a642807fde fix clippy warnings 2019-12-30 14:23:32 -08:00
fdb8ffca27 fix comment 2019-12-30 14:18:08 -08:00
b28c3fd04c add doc 2019-12-30 14:15:26 -08:00
0be9e5a7f0 implement dropout for BPE 2019-12-30 14:14:26 -08:00
5194daa0ce Merge pull request #20 from huggingface/docs
Clean up Rust docs
2019-12-30 14:17:14 -05:00
d163bbadae remove redundant headers, other small cleanups 2019-12-30 10:46:56 -08:00
225a886382 Python - Expose Whitespace PreTokenizer 2019-12-30 13:10:33 -05:00
4677a09626 Python - Expose pad and truncate on Encoding 2019-12-30 12:56:07 -05:00
8ddb2de64e Update unicode-normalization to published crate 2019-12-30 12:18:00 -05:00
f5327f977e Merge pull request #19 from huggingface/handle-offsets
Handle offsets
2019-12-30 10:46:30 -05:00
06d515d41b Python - Add ability to retrieve a range of string 2019-12-29 01:37:03 -05:00
049029dc42 Python - Restore methods on Encoding 2019-12-29 01:26:42 -05:00
708a63514a Add ability to retrieve ranges or NormalizedString 2019-12-29 01:22:16 -05:00
9c574ad1b7 Python - Fix some import warnings 2019-12-29 00:43:32 -05:00
3779bf3e19 Python - Update example 2019-12-29 00:38:37 -05:00
3dcf9f763c Python - Update pre tokenizers with offsets 2019-12-29 00:37:58 -05:00
3f79d9d5e0 Python - Add normalizers bindings & BertNormalizer 2019-12-29 00:36:09 -05:00
81be029881 Fix - Handle errors during normalization 2019-12-29 00:24:01 -05:00
79b96dccd0 Fix lowercase/uppercase normalization
Since each character being lowercased or uppercased can actually
generate one or more characters, we need to keep track of the offsets
being updated in the process.
2019-12-29 00:19:49 -05:00
22ffa716a1 BertPreTokenizer pre tokenize only (with offsets) 2019-12-29 00:12:24 -05:00
cda9fae992 Add BertNormalizer with offsets tracking 2019-12-29 00:10:45 -05:00
ad9cc52d83 ByteLevel PreTokenizer handles offsets 2019-12-29 00:08:42 -05:00
35a8dfdd55 Whitespace PreTokenizer handles offsets 2019-12-28 15:50:42 -05:00
be00a1e45e Improve clarity for BertProcessing 2019-12-28 15:45:51 -05:00
d7af007539 BertProcessor handles NormalizedString merging 2019-12-28 15:30:57 -05:00
f4df7f5e2a Update Tokenizer with NormalizedString & Encoding 2019-12-28 15:28:44 -05:00
4afcb1ef96 PreTokenizers handle offsets 2019-12-28 15:28:21 -05:00
8c40c89836 Encoding uses NormalizedString 2019-12-28 15:25:50 -05:00
162829b7a9 Introduce NormalizedString 2019-12-28 15:24:09 -05:00
96ef467bbf Use forked unicode-normalization 2019-12-28 15:22:52 -05:00
a4beecf944 WordPiece handles offsets 2019-12-28 15:22:03 -05:00
5682627223 BPE handles offsets 2019-12-28 15:21:50 -05:00
5d9848ad6c Models handles offsets 2019-12-28 15:21:29 -05:00
839239d3b4 Bump version 2019-12-27 10:43:34 -05:00
bddf7ba737 Python - Fix building from wheels 2019-12-27 10:39:19 -05:00
ffd28ba558 Bump for release 2019-12-26 14:56:13 -05:00
74cc6f6bde Python - Simplify padding interface 2019-12-26 14:34:13 -05:00
d1e59e09bf Fix a bug when adding special tokens
If we add special tokens that are part of the vocabulary of the model, the tokens aren't added to the tokenizer, which then built an empty regex. This completely break the tokenization
2019-12-26 14:32:50 -05:00
d93d4fc3cd Python - Simplify truncation interface 2019-12-26 10:35:20 -05:00
a7734ffc9f Python - Update doc and readme for add_prefix_space 2019-12-26 10:34:53 -05:00
1879cb0bcb Python - change with_added_tokens as kwarg 2019-12-25 22:22:35 -05:00
905c1eb77e Python - update some packages 2019-12-25 22:16:43 -05:00
597031b973 Python - remove unused variable 2019-12-25 22:16:11 -05:00
9d289d357d Python - change add_prefix_space as kwarg 2019-12-25 22:10:17 -05:00