Commit Graph

7 Commits

Author SHA1 Message Date
Nicolas Patry
ee3860c029 Enabling training parity check for tokenizers.UnigramTrainer 2020-09-02 13:38:14 -04:00
Nicolas Patry
dd91739ba0 Now spm_parity_check succeeds because we have the correct pre_tokenizer. 2020-09-02 16:32:50 +02:00
Nicolas Patry
e974cfb1c9 Formatting after rebase. 2020-09-02 16:32:50 +02:00
Nicolas Patry
439305eea0 Failing test for compatibility for SentencePieceUnigramTokenizer.
- We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A).
  Could be linked to float precision and hard or impossible to fix
(should not hinder model performance)

- We are now fusing_unk by default as it's the case with spm_train

- We are still failing on at least space deduplication. Probably should
  be handlded by a pre-tokenizer.
2020-09-02 16:32:50 +02:00
Nicolas Patry
6887c0f04d Black pass. 2020-08-31 14:05:39 -04:00
Sławomir Dadas
0865a9ad55 Python - improve compatibility with sentencepiece in the conversion script 2020-04-11 17:35:50 +02:00
Morgan Funtowicz
be10f542ce Added SentencePiece and YouTokenToMe model extractors.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-08 22:55:00 +01:00