1870 Commits

Author SHA1 Message Date
d3c3f5a700 Python - Expose ByteLevel alphabet 2020-01-02 18:06:06 -05:00
f0f9aefd07 ByteLevel exposes its alphabet 2020-01-02 17:51:26 -05:00
7b12b3cca5 BpeTrainer handles initial alphabet 2020-01-02 15:01:22 -05:00
c8a5d2e32a NormalizedString - Fix removal around edges 2020-01-02 14:17:14 -05:00
66b6211705 NormalizedString - Fix added chars at beginning 2020-01-02 14:17:14 -05:00
894ea1f8f0 utilize ::new() in ::default() 2020-01-02 10:56:41 -08:00
8ae0f2efdb set capacity on BPE cache, change Mutex to RwLock, create BpeBuilder (#24)
* set capacity on BPE cache, create BpeBuilder

* add doc comment

* switch from Mutex to RwLock

* vocab_and_merges
2020-01-02 09:26:50 -08:00
e3cf6a7b00 refactor benchmarks (#25)
* refactor benchmarks

* fix

* fix CI
2020-01-01 17:07:36 -08:00
138c48d92e add benchmark on many batches 2020-01-01 16:20:19 -08:00
b09511f5cf add better single threaded GPT2 benchmark 2020-01-01 15:48:53 -08:00
722b61230d BPE handles UNK token 2020-01-01 14:49:03 -05:00
75713ce809 Merge pull request #23 from huggingface/cache
Avoid creating unnecessary vectors when accessing cache
2020-01-01 14:47:28 -05:00
65471b4f2c Merge branch 'master' into cache 2020-01-01 14:10:20 -05:00
9a10acc981 don't create unnecessary vectors when accessing cache 2020-01-01 14:06:31 -05:00
a5c5e5840f Oops - Fix trainer 2020-01-01 13:36:42 -05:00
a7a5f9a67f BpeTrainer handles special tokens and limiting alphabet 2020-01-01 12:54:58 -05:00
ebf22198f3 Add benchmark framework and benches for BPE (GPT2) (#4)
* add benchmarks

* fix bench

* refactor BPE benchmarks

* fix

* remove un-needed gitignore

* update Cargo.lock

* fix

* small fix

* improve benchmarks

* move setup to Makefile

* benchmark BPE encode batch

* refactor batch benchmark
2020-01-01 07:35:57 -08:00
84c7a8623a Remove all printed logs 2020-01-01 01:45:24 -05:00
47e4b00e05 BpeTrainer shows some progress 2020-01-01 01:28:17 -05:00
f3aef0e4e6 Fix BPE saving (u32 => String) 2019-12-31 23:15:10 -05:00
90dfdc715d Expose Tokenizer parts 2019-12-31 22:57:47 -05:00
90df088054 Fix ByteLevel PreTokenizer
I broke it with my last changes. We cannot take a slice of a string by indexing on bytes obviously.
2019-12-31 15:09:51 -05:00
f28ca58fd9 [Fix #17] BPE & WordPiece models saving 2019-12-31 13:56:28 -05:00
2125e4d422 Merge pull request #21 from huggingface/dropout
Implement dropout for BPE
2019-12-30 19:39:29 -05:00
b21a5496a7 no cache when dropout 2019-12-30 15:58:16 -08:00
a642807fde fix clippy warnings 2019-12-30 14:23:32 -08:00
fdb8ffca27 fix comment 2019-12-30 14:18:08 -08:00
b28c3fd04c add doc 2019-12-30 14:15:26 -08:00
0be9e5a7f0 implement dropout for BPE 2019-12-30 14:14:26 -08:00
5194daa0ce Merge pull request #20 from huggingface/docs
Clean up Rust docs
2019-12-30 14:17:14 -05:00
d163bbadae remove redundant headers, other small cleanups 2019-12-30 10:46:56 -08:00
225a886382 Python - Expose Whitespace PreTokenizer 2019-12-30 13:10:33 -05:00
4677a09626 Python - Expose pad and truncate on Encoding 2019-12-30 12:56:07 -05:00
8ddb2de64e Update unicode-normalization to published crate 2019-12-30 12:18:00 -05:00
f5327f977e Merge pull request #19 from huggingface/handle-offsets
Handle offsets
2019-12-30 10:46:30 -05:00
06d515d41b Python - Add ability to retrieve a range of string 2019-12-29 01:37:03 -05:00
049029dc42 Python - Restore methods on Encoding 2019-12-29 01:26:42 -05:00
708a63514a Add ability to retrieve ranges or NormalizedString 2019-12-29 01:22:16 -05:00
9c574ad1b7 Python - Fix some import warnings 2019-12-29 00:43:32 -05:00
3779bf3e19 Python - Update example 2019-12-29 00:38:37 -05:00
3dcf9f763c Python - Update pre tokenizers with offsets 2019-12-29 00:37:58 -05:00
3f79d9d5e0 Python - Add normalizers bindings & BertNormalizer 2019-12-29 00:36:09 -05:00
81be029881 Fix - Handle errors during normalization 2019-12-29 00:24:01 -05:00
79b96dccd0 Fix lowercase/uppercase normalization
Since each character being lowercased or uppercased can actually
generate one or more characters, we need to keep track of the offsets
being updated in the process.
2019-12-29 00:19:49 -05:00
22ffa716a1 BertPreTokenizer pre tokenize only (with offsets) 2019-12-29 00:12:24 -05:00
cda9fae992 Add BertNormalizer with offsets tracking 2019-12-29 00:10:45 -05:00
ad9cc52d83 ByteLevel PreTokenizer handles offsets 2019-12-29 00:08:42 -05:00
35a8dfdd55 Whitespace PreTokenizer handles offsets 2019-12-28 15:50:42 -05:00
be00a1e45e Improve clarity for BertProcessing 2019-12-28 15:45:51 -05:00
d7af007539 BertProcessor handles NormalizedString merging 2019-12-28 15:30:57 -05:00