Commit Graph

209 Commits

Author SHA1 Message Date
75713ce809 Merge pull request #23 from huggingface/cache
Avoid creating unnecessary vectors when accessing cache
2020-01-01 14:47:28 -05:00
65471b4f2c Merge branch 'master' into cache 2020-01-01 14:10:20 -05:00
9a10acc981 don't create unnecessary vectors when accessing cache 2020-01-01 14:06:31 -05:00
a5c5e5840f Oops - Fix trainer 2020-01-01 13:36:42 -05:00
a7a5f9a67f BpeTrainer handles special tokens and limiting alphabet 2020-01-01 12:54:58 -05:00
ebf22198f3 Add benchmark framework and benches for BPE (GPT2) (#4)
* add benchmarks

* fix bench

* refactor BPE benchmarks

* fix

* remove un-needed gitignore

* update Cargo.lock

* fix

* small fix

* improve benchmarks

* move setup to Makefile

* benchmark BPE encode batch

* refactor batch benchmark
2020-01-01 07:35:57 -08:00
84c7a8623a Remove all printed logs 2020-01-01 01:45:24 -05:00
47e4b00e05 BpeTrainer shows some progress 2020-01-01 01:28:17 -05:00
f3aef0e4e6 Fix BPE saving (u32 => String) 2019-12-31 23:15:10 -05:00
90dfdc715d Expose Tokenizer parts 2019-12-31 22:57:47 -05:00
90df088054 Fix ByteLevel PreTokenizer
I broke it with my last changes. We cannot take a slice of a string by indexing on bytes obviously.
2019-12-31 15:09:51 -05:00
f28ca58fd9 [Fix #17] BPE & WordPiece models saving 2019-12-31 13:56:28 -05:00
2125e4d422 Merge pull request #21 from huggingface/dropout
Implement dropout for BPE
2019-12-30 19:39:29 -05:00
b21a5496a7 no cache when dropout 2019-12-30 15:58:16 -08:00
a642807fde fix clippy warnings 2019-12-30 14:23:32 -08:00
fdb8ffca27 fix comment 2019-12-30 14:18:08 -08:00
b28c3fd04c add doc 2019-12-30 14:15:26 -08:00
0be9e5a7f0 implement dropout for BPE 2019-12-30 14:14:26 -08:00
5194daa0ce Merge pull request #20 from huggingface/docs
Clean up Rust docs
2019-12-30 14:17:14 -05:00
d163bbadae remove redundant headers, other small cleanups 2019-12-30 10:46:56 -08:00
225a886382 Python - Expose Whitespace PreTokenizer 2019-12-30 13:10:33 -05:00
4677a09626 Python - Expose pad and truncate on Encoding 2019-12-30 12:56:07 -05:00
8ddb2de64e Update unicode-normalization to published crate 2019-12-30 12:18:00 -05:00
f5327f977e Merge pull request #19 from huggingface/handle-offsets
Handle offsets
2019-12-30 10:46:30 -05:00
06d515d41b Python - Add ability to retrieve a range of string 2019-12-29 01:37:03 -05:00
049029dc42 Python - Restore methods on Encoding 2019-12-29 01:26:42 -05:00
708a63514a Add ability to retrieve ranges or NormalizedString 2019-12-29 01:22:16 -05:00
9c574ad1b7 Python - Fix some import warnings 2019-12-29 00:43:32 -05:00
3779bf3e19 Python - Update example 2019-12-29 00:38:37 -05:00
3dcf9f763c Python - Update pre tokenizers with offsets 2019-12-29 00:37:58 -05:00
3f79d9d5e0 Python - Add normalizers bindings & BertNormalizer 2019-12-29 00:36:09 -05:00
81be029881 Fix - Handle errors during normalization 2019-12-29 00:24:01 -05:00
79b96dccd0 Fix lowercase/uppercase normalization
Since each character being lowercased or uppercased can actually
generate one or more characters, we need to keep track of the offsets
being updated in the process.
2019-12-29 00:19:49 -05:00
22ffa716a1 BertPreTokenizer pre tokenize only (with offsets) 2019-12-29 00:12:24 -05:00
cda9fae992 Add BertNormalizer with offsets tracking 2019-12-29 00:10:45 -05:00
ad9cc52d83 ByteLevel PreTokenizer handles offsets 2019-12-29 00:08:42 -05:00
35a8dfdd55 Whitespace PreTokenizer handles offsets 2019-12-28 15:50:42 -05:00
be00a1e45e Improve clarity for BertProcessing 2019-12-28 15:45:51 -05:00
d7af007539 BertProcessor handles NormalizedString merging 2019-12-28 15:30:57 -05:00
f4df7f5e2a Update Tokenizer with NormalizedString & Encoding 2019-12-28 15:28:44 -05:00
4afcb1ef96 PreTokenizers handle offsets 2019-12-28 15:28:21 -05:00
8c40c89836 Encoding uses NormalizedString 2019-12-28 15:25:50 -05:00
162829b7a9 Introduce NormalizedString 2019-12-28 15:24:09 -05:00
96ef467bbf Use forked unicode-normalization 2019-12-28 15:22:52 -05:00
a4beecf944 WordPiece handles offsets 2019-12-28 15:22:03 -05:00
5682627223 BPE handles offsets 2019-12-28 15:21:50 -05:00
5d9848ad6c Models handles offsets 2019-12-28 15:21:29 -05:00
839239d3b4 Bump version 2019-12-27 10:43:34 -05:00
bddf7ba737 Python - Fix building from wheels 2019-12-27 10:39:19 -05:00
ffd28ba558 Bump for release 2019-12-26 14:56:13 -05:00