tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-06 04:38:23 +00:00

Author	SHA1	Message	Date
Nicolas Patry	ee3860c029	Enabling training parity check for tokenizers.UnigramTrainer	2020-09-02 13:38:14 -04:00
Nicolas Patry	dd91739ba0	Now spm_parity_check succeeds because we have the correct pre_tokenizer.	2020-09-02 16:32:50 +02:00
Nicolas Patry	e974cfb1c9	Formatting after rebase.	2020-09-02 16:32:50 +02:00
Nicolas Patry	439305eea0	Failing test for compatibility for SentencePieceUnigramTokenizer. - We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A). Could be linked to float precision and hard or impossible to fix (should not hinder model performance) - We are now fusing_unk by default as it's the case with spm_train - We are still failing on at least space deduplication. Probably should be handlded by a pre-tokenizer.	2020-09-02 16:32:50 +02:00
Nicolas Patry	6887c0f04d	Black pass.	2020-08-31 14:05:39 -04:00
Sławomir Dadas	0865a9ad55	Python - improve compatibility with sentencepiece in the conversion script	2020-04-11 17:35:50 +02:00
Morgan Funtowicz	be10f542ce	Added SentencePiece and YouTokenToMe model extractors. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-01-08 22:55:00 +01:00