* Removing all pre_tokenizer logic from Unigram algorithm.
* Improving *a lot* the parity check.
- We can now detect a lot more errors
- Special cases have been added temporarily.
* Adding 2 new normalizers that mimick spm defaut's behavior.
* Adding `encoding_optimized` version of the `encode` algorithm.
- Removes Lattice allocation.
- Changes trie `common_prefix_search` to return an iterator to avoid
allocation of the full results.
* Trie<char> -> Trie<u8> Another improvement on speed.
* [WIP] Attempt to create a Precompiled Normalizer from SPM to be 100%
compliant with arbitrary models.
* Adding a new `Precompiled` Normalizer that is replacing `SpmNmtNfkc`.
- It will be used for direct compatiblity with `Spm` and replace all
their custom rules by using directly the normalizer spec embedded
within spm files, removing all need for any rules for us.
- We need `nom` dependency to parse the binary format of `spm`.
- We need to add `sentencepiece_model_pb2.py` file to be able to read
the proto file.
- We reimplemented their `Darts::DoubleArray` compact trie format.
* Fixing a bug with Precompiled normalizer.
* Fixing some edge cases (now in tests) with this weird precompiled
normalizer.
It seems a very handy crafted trie does not prevent from shooting
oneself in the foot. Sorry future reader.
* Keep API stable for this PR (change of the API should come later #409).
- Removed sentencepiece_model_pb2 from binding and add instructions to
make `from_spm` work.
* Adding model check in `from_spm`.
* Adressing @n1t0's comments.
* Adding a check to make sure alignments stay correct.
Also added a bit more documentation on how Precompiled works.
* Extracting `Precompiled` into it's own `spm_precompiled` crate.
* Using ranges in `do_nmt`.
- We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A).
Could be linked to float precision and hard or impossible to fix
(should not hinder model performance)
- We are now fusing_unk by default as it's the case with spm_train
- We are still failing on at least space deduplication. Probably should
be handlded by a pre-tokenizer.