mirror of
https://github.com/mii443/tokenizers.git
synced 2025-12-06 04:38:23 +00:00
- We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A). Could be linked to float precision and hard or impossible to fix (should not hinder model performance) - We are now fusing_unk by default as it's the case with spm_train - We are still failing on at least space deduplication. Probably should be handlded by a pre-tokenizer.