22ffa716a1
BertPreTokenizer pre tokenize only (with offsets)
2019-12-29 00:12:24 -05:00
cda9fae992
Add BertNormalizer with offsets tracking
2019-12-29 00:10:45 -05:00
ad9cc52d83
ByteLevel PreTokenizer handles offsets
2019-12-29 00:08:42 -05:00
35a8dfdd55
Whitespace PreTokenizer handles offsets
2019-12-28 15:50:42 -05:00
be00a1e45e
Improve clarity for BertProcessing
2019-12-28 15:45:51 -05:00
d7af007539
BertProcessor handles NormalizedString merging
2019-12-28 15:30:57 -05:00
f4df7f5e2a
Update Tokenizer with NormalizedString & Encoding
2019-12-28 15:28:44 -05:00
4afcb1ef96
PreTokenizers handle offsets
2019-12-28 15:28:21 -05:00
8c40c89836
Encoding uses NormalizedString
2019-12-28 15:25:50 -05:00
162829b7a9
Introduce NormalizedString
2019-12-28 15:24:09 -05:00
96ef467bbf
Use forked unicode-normalization
2019-12-28 15:22:52 -05:00
a4beecf944
WordPiece handles offsets
2019-12-28 15:22:03 -05:00
5682627223
BPE handles offsets
2019-12-28 15:21:50 -05:00
5d9848ad6c
Models handles offsets
2019-12-28 15:21:29 -05:00
839239d3b4
Bump version
2019-12-27 10:43:34 -05:00
bddf7ba737
Python - Fix building from wheels
2019-12-27 10:39:19 -05:00
ffd28ba558
Bump for release
2019-12-26 14:56:13 -05:00
74cc6f6bde
Python - Simplify padding interface
2019-12-26 14:34:13 -05:00
d1e59e09bf
Fix a bug when adding special tokens
...
If we add special tokens that are part of the vocabulary of the model, the tokens aren't added to the tokenizer, which then built an empty regex. This completely break the tokenization
2019-12-26 14:32:50 -05:00
d93d4fc3cd
Python - Simplify truncation interface
2019-12-26 10:35:20 -05:00
a7734ffc9f
Python - Update doc and readme for add_prefix_space
2019-12-26 10:34:53 -05:00
1879cb0bcb
Python - change with_added_tokens as kwarg
2019-12-25 22:22:35 -05:00
905c1eb77e
Python - update some packages
2019-12-25 22:16:43 -05:00
597031b973
Python - remove unused variable
2019-12-25 22:16:11 -05:00
9d289d357d
Python - change add_prefix_space as kwarg
2019-12-25 22:10:17 -05:00
4bc5a7bbe7
Python - fix example
2019-12-24 11:20:40 -05:00
cf0e8917cd
Fix whitespace handling in ByteLevel
2019-12-24 11:20:26 -05:00
9f1421a04b
remove Cargo.lock ( #7 )
2019-12-23 21:22:42 -08:00
c0ed873c4d
simplify initialization of BpeTrainer
2019-12-23 20:13:48 -05:00
fab1d4cabc
Bump version for release
2019-12-23 17:28:38 -05:00
e01d4f2052
Python - Remove misleading __repr__
2019-12-23 17:27:59 -05:00
2159123d7c
Fix truncate
2019-12-23 17:27:43 -05:00
8fb94be3d0
Merge pull request #6 from huggingface/BPE-tests
...
Add BPE tests and documentation
2019-12-20 15:34:38 -05:00
9a91016877
Merge branch 'master' into BPE-tests
2019-12-20 08:55:41 -08:00
2266960ef7
Bump version and update Readme
2019-12-20 10:26:40 -05:00
f2b9c30ad9
Handle vocab size with added tokens
2019-12-19 20:19:56 -05:00
b7040e0412
Option to skip special tokens while decoding
2019-12-19 20:03:02 -05:00
a8d68d516d
Handle special tokens
2019-12-19 19:48:16 -05:00
7f032b62df
Include the added tokens while converting tokens and ids
2019-12-19 18:32:37 -05:00
076ba297fb
Cannot add new tokens that already exist in the vocab
2019-12-19 18:32:03 -05:00
6d51e7a393
add example / doc test for BPE trainer
2019-12-19 15:28:58 -08:00
69212e17e9
formatting
2019-12-19 15:07:27 -08:00
a16daa78f1
add test for word merge
2019-12-19 14:45:38 -08:00
184b09e3ac
add more tests
2019-12-18 17:40:13 -08:00
1dc0debe36
add initial test
2019-12-18 16:45:11 -08:00
9763282d59
Bump version for release
2019-12-17 18:42:34 -05:00
4d14b08afe
ByteLevel handles prefix spaces
2019-12-17 18:35:40 -05:00
6766585965
Python - Do not expose non working features of Encoding
2019-12-17 17:43:42 -05:00
0a3d4a86a9
Python - Update bindings for BertPreTokenizer
2019-12-17 17:40:56 -05:00
e54eee7657
BasicPreTokenizer => BertPreTokenizer
2019-12-17 17:37:13 -05:00