- We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A).
Could be linked to float precision and hard or impossible to fix
(should not hinder model performance)
- We are now fusing_unk by default as it's the case with spm_train
- We are still failing on at least space deduplication. Probably should
be handlded by a pre-tokenizer.