f3aef0e4e6
Fix BPE saving (u32 => String)
2019-12-31 23:15:10 -05:00
90dfdc715d
Expose Tokenizer parts
2019-12-31 22:57:47 -05:00
90df088054
Fix ByteLevel PreTokenizer
...
I broke it with my last changes. We cannot take a slice of a string by indexing on bytes obviously.
2019-12-31 15:09:51 -05:00
f28ca58fd9
[ Fix #17 ] BPE & WordPiece models saving
2019-12-31 13:56:28 -05:00
2125e4d422
Merge pull request #21 from huggingface/dropout
...
Implement dropout for BPE
2019-12-30 19:39:29 -05:00
b21a5496a7
no cache when dropout
2019-12-30 15:58:16 -08:00
a642807fde
fix clippy warnings
2019-12-30 14:23:32 -08:00
fdb8ffca27
fix comment
2019-12-30 14:18:08 -08:00
b28c3fd04c
add doc
2019-12-30 14:15:26 -08:00
0be9e5a7f0
implement dropout for BPE
2019-12-30 14:14:26 -08:00
5194daa0ce
Merge pull request #20 from huggingface/docs
...
Clean up Rust docs
2019-12-30 14:17:14 -05:00
d163bbadae
remove redundant headers, other small cleanups
2019-12-30 10:46:56 -08:00
225a886382
Python - Expose Whitespace PreTokenizer
2019-12-30 13:10:33 -05:00
4677a09626
Python - Expose pad and truncate on Encoding
2019-12-30 12:56:07 -05:00
8ddb2de64e
Update unicode-normalization to published crate
2019-12-30 12:18:00 -05:00
f5327f977e
Merge pull request #19 from huggingface/handle-offsets
...
Handle offsets
2019-12-30 10:46:30 -05:00
06d515d41b
Python - Add ability to retrieve a range of string
2019-12-29 01:37:03 -05:00
049029dc42
Python - Restore methods on Encoding
2019-12-29 01:26:42 -05:00
708a63514a
Add ability to retrieve ranges or NormalizedString
2019-12-29 01:22:16 -05:00
9c574ad1b7
Python - Fix some import warnings
2019-12-29 00:43:32 -05:00
3779bf3e19
Python - Update example
2019-12-29 00:38:37 -05:00
3dcf9f763c
Python - Update pre tokenizers with offsets
2019-12-29 00:37:58 -05:00
3f79d9d5e0
Python - Add normalizers bindings & BertNormalizer
2019-12-29 00:36:09 -05:00
81be029881
Fix - Handle errors during normalization
2019-12-29 00:24:01 -05:00
79b96dccd0
Fix lowercase/uppercase normalization
...
Since each character being lowercased or uppercased can actually
generate one or more characters, we need to keep track of the offsets
being updated in the process.
2019-12-29 00:19:49 -05:00
22ffa716a1
BertPreTokenizer pre tokenize only (with offsets)
2019-12-29 00:12:24 -05:00
cda9fae992
Add BertNormalizer with offsets tracking
2019-12-29 00:10:45 -05:00
ad9cc52d83
ByteLevel PreTokenizer handles offsets
2019-12-29 00:08:42 -05:00
35a8dfdd55
Whitespace PreTokenizer handles offsets
2019-12-28 15:50:42 -05:00
be00a1e45e
Improve clarity for BertProcessing
2019-12-28 15:45:51 -05:00
d7af007539
BertProcessor handles NormalizedString merging
2019-12-28 15:30:57 -05:00
f4df7f5e2a
Update Tokenizer with NormalizedString & Encoding
2019-12-28 15:28:44 -05:00
4afcb1ef96
PreTokenizers handle offsets
2019-12-28 15:28:21 -05:00
8c40c89836
Encoding uses NormalizedString
2019-12-28 15:25:50 -05:00
162829b7a9
Introduce NormalizedString
2019-12-28 15:24:09 -05:00
96ef467bbf
Use forked unicode-normalization
2019-12-28 15:22:52 -05:00
a4beecf944
WordPiece handles offsets
2019-12-28 15:22:03 -05:00
5682627223
BPE handles offsets
2019-12-28 15:21:50 -05:00
5d9848ad6c
Models handles offsets
2019-12-28 15:21:29 -05:00
839239d3b4
Bump version
2019-12-27 10:43:34 -05:00
bddf7ba737
Python - Fix building from wheels
2019-12-27 10:39:19 -05:00
ffd28ba558
Bump for release
2019-12-26 14:56:13 -05:00
74cc6f6bde
Python - Simplify padding interface
2019-12-26 14:34:13 -05:00
d1e59e09bf
Fix a bug when adding special tokens
...
If we add special tokens that are part of the vocabulary of the model, the tokens aren't added to the tokenizer, which then built an empty regex. This completely break the tokenization
2019-12-26 14:32:50 -05:00
d93d4fc3cd
Python - Simplify truncation interface
2019-12-26 10:35:20 -05:00
a7734ffc9f
Python - Update doc and readme for add_prefix_space
2019-12-26 10:34:53 -05:00
1879cb0bcb
Python - change with_added_tokens as kwarg
2019-12-25 22:22:35 -05:00
905c1eb77e
Python - update some packages
2019-12-25 22:16:43 -05:00
597031b973
Python - remove unused variable
2019-12-25 22:16:11 -05:00
9d289d357d
Python - change add_prefix_space as kwarg
2019-12-25 22:10:17 -05:00