d3c3f5a700
Python - Expose ByteLevel alphabet
2020-01-02 18:06:06 -05:00
f0f9aefd07
ByteLevel exposes its alphabet
2020-01-02 17:51:26 -05:00
7b12b3cca5
BpeTrainer handles initial alphabet
2020-01-02 15:01:22 -05:00
c8a5d2e32a
NormalizedString - Fix removal around edges
2020-01-02 14:17:14 -05:00
66b6211705
NormalizedString - Fix added chars at beginning
2020-01-02 14:17:14 -05:00
894ea1f8f0
utilize ::new() in ::default()
2020-01-02 10:56:41 -08:00
8ae0f2efdb
set capacity on BPE cache, change Mutex to RwLock, create BpeBuilder ( #24 )
...
* set capacity on BPE cache, create BpeBuilder
* add doc comment
* switch from Mutex to RwLock
* vocab_and_merges
2020-01-02 09:26:50 -08:00
e3cf6a7b00
refactor benchmarks ( #25 )
...
* refactor benchmarks
* fix
* fix CI
2020-01-01 17:07:36 -08:00
138c48d92e
add benchmark on many batches
2020-01-01 16:20:19 -08:00
b09511f5cf
add better single threaded GPT2 benchmark
2020-01-01 15:48:53 -08:00
722b61230d
BPE handles UNK token
2020-01-01 14:49:03 -05:00
75713ce809
Merge pull request #23 from huggingface/cache
...
Avoid creating unnecessary vectors when accessing cache
2020-01-01 14:47:28 -05:00
65471b4f2c
Merge branch 'master' into cache
2020-01-01 14:10:20 -05:00
9a10acc981
don't create unnecessary vectors when accessing cache
2020-01-01 14:06:31 -05:00
a5c5e5840f
Oops - Fix trainer
2020-01-01 13:36:42 -05:00
a7a5f9a67f
BpeTrainer handles special tokens and limiting alphabet
2020-01-01 12:54:58 -05:00
ebf22198f3
Add benchmark framework and benches for BPE (GPT2) ( #4 )
...
* add benchmarks
* fix bench
* refactor BPE benchmarks
* fix
* remove un-needed gitignore
* update Cargo.lock
* fix
* small fix
* improve benchmarks
* move setup to Makefile
* benchmark BPE encode batch
* refactor batch benchmark
2020-01-01 07:35:57 -08:00
84c7a8623a
Remove all printed logs
2020-01-01 01:45:24 -05:00
47e4b00e05
BpeTrainer shows some progress
2020-01-01 01:28:17 -05:00
f3aef0e4e6
Fix BPE saving (u32 => String)
2019-12-31 23:15:10 -05:00
90dfdc715d
Expose Tokenizer parts
2019-12-31 22:57:47 -05:00
90df088054
Fix ByteLevel PreTokenizer
...
I broke it with my last changes. We cannot take a slice of a string by indexing on bytes obviously.
2019-12-31 15:09:51 -05:00
f28ca58fd9
[ Fix #17 ] BPE & WordPiece models saving
2019-12-31 13:56:28 -05:00
2125e4d422
Merge pull request #21 from huggingface/dropout
...
Implement dropout for BPE
2019-12-30 19:39:29 -05:00
b21a5496a7
no cache when dropout
2019-12-30 15:58:16 -08:00
a642807fde
fix clippy warnings
2019-12-30 14:23:32 -08:00
fdb8ffca27
fix comment
2019-12-30 14:18:08 -08:00
b28c3fd04c
add doc
2019-12-30 14:15:26 -08:00
0be9e5a7f0
implement dropout for BPE
2019-12-30 14:14:26 -08:00
5194daa0ce
Merge pull request #20 from huggingface/docs
...
Clean up Rust docs
2019-12-30 14:17:14 -05:00
d163bbadae
remove redundant headers, other small cleanups
2019-12-30 10:46:56 -08:00
225a886382
Python - Expose Whitespace PreTokenizer
2019-12-30 13:10:33 -05:00
4677a09626
Python - Expose pad and truncate on Encoding
2019-12-30 12:56:07 -05:00
8ddb2de64e
Update unicode-normalization to published crate
2019-12-30 12:18:00 -05:00
f5327f977e
Merge pull request #19 from huggingface/handle-offsets
...
Handle offsets
2019-12-30 10:46:30 -05:00
06d515d41b
Python - Add ability to retrieve a range of string
2019-12-29 01:37:03 -05:00
049029dc42
Python - Restore methods on Encoding
2019-12-29 01:26:42 -05:00
708a63514a
Add ability to retrieve ranges or NormalizedString
2019-12-29 01:22:16 -05:00
9c574ad1b7
Python - Fix some import warnings
2019-12-29 00:43:32 -05:00
3779bf3e19
Python - Update example
2019-12-29 00:38:37 -05:00
3dcf9f763c
Python - Update pre tokenizers with offsets
2019-12-29 00:37:58 -05:00
3f79d9d5e0
Python - Add normalizers bindings & BertNormalizer
2019-12-29 00:36:09 -05:00
81be029881
Fix - Handle errors during normalization
2019-12-29 00:24:01 -05:00
79b96dccd0
Fix lowercase/uppercase normalization
...
Since each character being lowercased or uppercased can actually
generate one or more characters, we need to keep track of the offsets
being updated in the process.
2019-12-29 00:19:49 -05:00
22ffa716a1
BertPreTokenizer pre tokenize only (with offsets)
2019-12-29 00:12:24 -05:00
cda9fae992
Add BertNormalizer with offsets tracking
2019-12-29 00:10:45 -05:00
ad9cc52d83
ByteLevel PreTokenizer handles offsets
2019-12-29 00:08:42 -05:00
35a8dfdd55
Whitespace PreTokenizer handles offsets
2019-12-28 15:50:42 -05:00
be00a1e45e
Improve clarity for BertProcessing
2019-12-28 15:45:51 -05:00
d7af007539
BertProcessor handles NormalizedString merging
2019-12-28 15:30:57 -05:00