Commit Graph

12 Commits

Author SHA1 Message Date
4b0dc6b947 Fix SPM conversions (#686)
* Fix SPM conversions

* Update changelog

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-05-20 09:55:55 -04:00
e999a7b5f9 Revert "Fix SPM conversions"
This reverts commit e1ffe39764.
2021-04-21 18:09:58 -04:00
e1ffe39764 Fix SPM conversions 2021-04-21 18:09:49 -04:00
96b9972842 Fix SentencePiece tokenizers conversion 2021-02-03 12:44:46 -05:00
598ce61229 Removed now wrong code in convert.py, fixed strange black magic. 2020-09-24 08:57:02 +02:00
8f8156fd2c Adressing first pass of comments. 2020-09-24 08:57:02 +02:00
9d3a93db5b Going back for not fuse_unk by default for BPE, but add a flag to
enable it.
2020-09-22 16:27:09 -04:00
033b98ce59 Updating convert scripts with Replace normalizer. 2020-09-22 08:21:38 +02:00
c59b216baa Fixing convert/check scripts. 2020-09-22 08:21:38 +02:00
b16406c900 Moving StripAccents within normalizer for Albert +XLNet, but now crash
in Precompiled. offsets are wrong ?
2020-09-22 08:21:38 +02:00
275ee6d4c4 Making convert script machine agnostic. 2020-09-22 08:21:38 +02:00
2fd1d9cf06 Adding a new convert script, that will convert all python Tokenizer code
into a proper Rust Tokenizer format and check it on a file.

- Also fuse_unks by default in `tokenizers`'s BPE.
2020-09-22 08:21:38 +02:00