- We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A).
Could be linked to float precision and hard or impossible to fix
(should not hinder model performance)
- We are now fusing_unk by default as it's the case with spm_train
- We are still failing on at least space deduplication. Probably should
be handlded by a pre-tokenizer.
- Deduplication : Removes duplicate spaces within strings
- Punctuation: Splits punctuation characters as isolated tokens
- Sequence: Applies a list of pretokenizers iteratively
* derive Clone on Tokenizer and AddedVocabulary.
* Replace Container with Arc wrapper for Decoders.
* Prefix Rust Decoder types with Py.
* Rename PyDecoder to CustomDecoder.
* Change panic in serializing custom decoder to exception.
* Re-enable training with cloneable Tokenizer.
* Remove unsound Container, use Arc wrappers instead.
* prefix the Python types in Rust with Py, rename PyPretokenizer
to CustomPretokenizer
* remove unsound Container wrappers, replace with Arc
* change panic on trying to (de-)serialize custom pretokenizer to
exception
* Implement changes necessary from generic Model in Tokenizer.
* Temporarily disable training in Python since Clone can't be
derived for Model until all components have been replaced.
* Prefix Python types in Rust with Py.
This allows testing versions not built in-place. Otherwise
importing (or testing) in the package root fails without develop
builds.
Replace maturin with setuptools_rust since maturin fails with
proper project structure.
If a vocab file isn't provided the supplied unk token (different from [UNK]) gets ignored and later throws an error:
Exception: WordPiece error: Missing [UNK] token from the vocabulary
when trying to encode an input string with an unknown token.