Commit Graph

858 Commits

Author SHA1 Message Date
36e3c28a23 Ignore .vim folder 2020-05-01 17:11:54 -04:00
dbc8e68c68 Python - Update tests for new encode 2020-05-01 17:11:54 -04:00
2e105c4258 Python - Update typings for new encode 2020-05-01 17:11:54 -04:00
835f08ab02 Python - Update bindings for new encode 2020-05-01 17:11:54 -04:00
993c1c80a8 Rust - Add some tests 2020-05-01 17:11:54 -04:00
d7a5496606 Rust - encode_batch uses the new interface 2020-05-01 17:07:10 -04:00
52fda08f6e Rust - Update tests with new encode interface 2020-05-01 17:07:10 -04:00
6ed5ce22e0 Rust - Encode uses the new interface 2020-05-01 17:07:09 -04:00
15aae7bab2 Rust - Further improve encode interface 2020-04-24 22:44:12 -04:00
b5d47754ad Rust - New encode interface 2020-04-24 22:44:11 -04:00
02cc97756f Rust - Improve TruncationError 2020-04-24 12:13:17 -04:00
7d2b59b0aa Rust - Add len() and is_empty() on Encoding 2020-04-24 11:44:10 -04:00
9d75e38cc9 Merge pull request #241 from jaymody/master
Python - Fix bug in bert wordpiece example script
2020-04-22 16:12:20 -04:00
a28fd29204 Python - Fix bug in bert wordpiece example script 2020-04-18 17:50:52 -04:00
670f619ab5 Python - bump to 0.7.0 for final release 2020-04-17 12:48:10 -04:00
5be775df0e Rust - ByteLevel can trim "real" whitespaces too
This shouldn't be needed in most cases, but if the tokens include
an AddedToken with a whitespace, it will handle this case too.
2020-04-17 10:47:39 -04:00
3312ad75d9 Python - Bump to 0.7.0rc6 for release 2020-04-16 19:39:04 -04:00
db25a29e96 Python - Update CharBPETokenizer to match GPT BPE (#239) 2020-04-16 19:36:41 -04:00
0756480b83 Fix offsets (#238) 2020-04-16 19:35:07 -04:00
ad0e488998 Python - Update changelog 2020-04-16 19:32:54 -04:00
249a282f1d Python - Fix style 2020-04-16 19:31:00 -04:00
77590b9291 style! 2020-04-17 01:29:52 +02:00
7216486686 Update CharLevelBPE 2020-04-17 01:15:02 +02:00
873ac2d9a8 Python - Add missing char_to_word 2020-04-16 18:20:30 -04:00
0524efa8a4 Rust - Fix trimming trailing offset 2020-04-16 16:49:10 -04:00
75e88464a7 Make bytelevel trim offsets test fail 😬 2020-04-16 16:18:10 -04:00
1865ec8d66 Node - Tweak robertaProcessing param types 2020-04-16 16:17:20 -04:00
bdfb02f473 Python - Bump to 0.7.0rc6 for release 2020-04-16 14:42:22 -04:00
5945d2892c Improve mappings (#234) 2020-04-16 14:36:53 -04:00
8834508547 Update CHANGELOGs 2020-04-16 14:25:19 -04:00
71b7830d1b Rust | Python | Node - Also add char_to_word 2020-04-16 14:23:37 -04:00
4aecd82d07 Node - Improve mappings on Encoding 2020-04-16 14:23:37 -04:00
c5e22c14cb Python - Improve mappings on Encoding 2020-04-16 14:23:37 -04:00
3fb347b453 Rust - Improve mappings on Encoding 2020-04-16 14:23:35 -04:00
0de276d2a9 Fix offsets (#236) 2020-04-16 14:22:50 -04:00
c96c4d95bd Update CHANGELOGs 2020-04-16 10:34:34 -04:00
95d4ee18f7 Node - Add offsets trimming to RobertaProcessing 2020-04-15 19:15:32 -04:00
81e2cc2fc4 Python - Add offsets trimming to RobertaProcessing 2020-04-15 18:49:38 -04:00
7caa4d94d2 Rust - Add offsets trimming to RobertaProcessing 2020-04-15 18:34:12 -04:00
6058f7576e Rust - ByteLevel also trims overflowing encodings 2020-04-15 17:24:15 -04:00
690a0dfb6d Rust - Fix ByteLevel trimming original offsets 2020-04-15 17:07:24 -04:00
26d4aa3c79 Rust - Fix offsets when merging multiple sequences
When the input sequence gets split into multiple sub-sequences,
there may be changes in the offsets (original <=> normalized) that
don't get reverted when merging back to one single sequence.
So in order to avoid this, we have to convert back to original offsets
before actually merging the various encodings and normalized strings back
together.
2020-04-15 16:41:57 -04:00
c164baf539 Node - Version 0.6.2 2020-04-13 16:57:44 -04:00
38d53a7b84 Node - Expose more bindings 2020-04-13 16:48:32 -04:00
a42f3581ba Python - improve compatibility with sentencepiece in the conversion script (#229) 2020-04-13 10:48:07 -04:00
0865a9ad55 Python - improve compatibility with sentencepiece in the conversion script 2020-04-11 17:35:50 +02:00
09104afd07 Python - Bump to 0.7.0-rc5 for release 2020-04-09 11:41:10 -04:00
af66d6fc6f Rust - Bump to 0.10.1 for release 2020-04-09 11:30:59 -04:00
f9c76b6c82 Python - Use PyO3 0.9.2 (#227) 2020-04-09 11:26:36 -04:00
a6c33f5de8 Python - update some dependencies 2020-04-09 10:56:26 -04:00