75713ce809
Merge pull request #23 from huggingface/cache
...
Avoid creating unnecessary vectors when accessing cache
2020-01-01 14:47:28 -05:00
65471b4f2c
Merge branch 'master' into cache
2020-01-01 14:10:20 -05:00
9a10acc981
don't create unnecessary vectors when accessing cache
2020-01-01 14:06:31 -05:00
a5c5e5840f
Oops - Fix trainer
2020-01-01 13:36:42 -05:00
a7a5f9a67f
BpeTrainer handles special tokens and limiting alphabet
2020-01-01 12:54:58 -05:00
ebf22198f3
Add benchmark framework and benches for BPE (GPT2) ( #4 )
...
* add benchmarks
* fix bench
* refactor BPE benchmarks
* fix
* remove un-needed gitignore
* update Cargo.lock
* fix
* small fix
* improve benchmarks
* move setup to Makefile
* benchmark BPE encode batch
* refactor batch benchmark
2020-01-01 07:35:57 -08:00
84c7a8623a
Remove all printed logs
2020-01-01 01:45:24 -05:00
47e4b00e05
BpeTrainer shows some progress
2020-01-01 01:28:17 -05:00
f3aef0e4e6
Fix BPE saving (u32 => String)
2019-12-31 23:15:10 -05:00
90dfdc715d
Expose Tokenizer parts
2019-12-31 22:57:47 -05:00
90df088054
Fix ByteLevel PreTokenizer
...
I broke it with my last changes. We cannot take a slice of a string by indexing on bytes obviously.
2019-12-31 15:09:51 -05:00
f28ca58fd9
[ Fix #17 ] BPE & WordPiece models saving
2019-12-31 13:56:28 -05:00
2125e4d422
Merge pull request #21 from huggingface/dropout
...
Implement dropout for BPE
2019-12-30 19:39:29 -05:00
b21a5496a7
no cache when dropout
2019-12-30 15:58:16 -08:00
a642807fde
fix clippy warnings
2019-12-30 14:23:32 -08:00
fdb8ffca27
fix comment
2019-12-30 14:18:08 -08:00
b28c3fd04c
add doc
2019-12-30 14:15:26 -08:00
0be9e5a7f0
implement dropout for BPE
2019-12-30 14:14:26 -08:00
5194daa0ce
Merge pull request #20 from huggingface/docs
...
Clean up Rust docs
2019-12-30 14:17:14 -05:00
d163bbadae
remove redundant headers, other small cleanups
2019-12-30 10:46:56 -08:00
225a886382
Python - Expose Whitespace PreTokenizer
2019-12-30 13:10:33 -05:00
4677a09626
Python - Expose pad and truncate on Encoding
2019-12-30 12:56:07 -05:00
8ddb2de64e
Update unicode-normalization to published crate
2019-12-30 12:18:00 -05:00
f5327f977e
Merge pull request #19 from huggingface/handle-offsets
...
Handle offsets
2019-12-30 10:46:30 -05:00
06d515d41b
Python - Add ability to retrieve a range of string
2019-12-29 01:37:03 -05:00
049029dc42
Python - Restore methods on Encoding
2019-12-29 01:26:42 -05:00
708a63514a
Add ability to retrieve ranges or NormalizedString
2019-12-29 01:22:16 -05:00
9c574ad1b7
Python - Fix some import warnings
2019-12-29 00:43:32 -05:00
3779bf3e19
Python - Update example
2019-12-29 00:38:37 -05:00
3dcf9f763c
Python - Update pre tokenizers with offsets
2019-12-29 00:37:58 -05:00
3f79d9d5e0
Python - Add normalizers bindings & BertNormalizer
2019-12-29 00:36:09 -05:00
81be029881
Fix - Handle errors during normalization
2019-12-29 00:24:01 -05:00
79b96dccd0
Fix lowercase/uppercase normalization
...
Since each character being lowercased or uppercased can actually
generate one or more characters, we need to keep track of the offsets
being updated in the process.
2019-12-29 00:19:49 -05:00
22ffa716a1
BertPreTokenizer pre tokenize only (with offsets)
2019-12-29 00:12:24 -05:00
cda9fae992
Add BertNormalizer with offsets tracking
2019-12-29 00:10:45 -05:00
ad9cc52d83
ByteLevel PreTokenizer handles offsets
2019-12-29 00:08:42 -05:00
35a8dfdd55
Whitespace PreTokenizer handles offsets
2019-12-28 15:50:42 -05:00
be00a1e45e
Improve clarity for BertProcessing
2019-12-28 15:45:51 -05:00
d7af007539
BertProcessor handles NormalizedString merging
2019-12-28 15:30:57 -05:00
f4df7f5e2a
Update Tokenizer with NormalizedString & Encoding
2019-12-28 15:28:44 -05:00
4afcb1ef96
PreTokenizers handle offsets
2019-12-28 15:28:21 -05:00
8c40c89836
Encoding uses NormalizedString
2019-12-28 15:25:50 -05:00
162829b7a9
Introduce NormalizedString
2019-12-28 15:24:09 -05:00
96ef467bbf
Use forked unicode-normalization
2019-12-28 15:22:52 -05:00
a4beecf944
WordPiece handles offsets
2019-12-28 15:22:03 -05:00
5682627223
BPE handles offsets
2019-12-28 15:21:50 -05:00
5d9848ad6c
Models handles offsets
2019-12-28 15:21:29 -05:00
839239d3b4
Bump version
2019-12-27 10:43:34 -05:00
bddf7ba737
Python - Fix building from wheels
2019-12-27 10:39:19 -05:00
ffd28ba558
Bump for release
2019-12-26 14:56:13 -05:00