- We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A).
Could be linked to float precision and hard or impossible to fix
(should not hinder model performance)
- We are now fusing_unk by default as it's the case with spm_train
- We are still failing on at least space deduplication. Probably should
be handlded by a pre-tokenizer.
- Deduplication : Removes duplicate spaces within strings
- Punctuation: Splits punctuation characters as isolated tokens
- Sequence: Applies a list of pretokenizers iteratively
This allows testing versions not built in-place. Otherwise
importing (or testing) in the package root fails without develop
builds.
Replace maturin with setuptools_rust since maturin fails with
proper project structure.