* New version.
The actual release will happen *before* PyO3 0.17.2 because
the tests were ran before than.
* Manylinux2014 necessary now with Rust 1.64.
* Removing dead file.
* Checking that we can distribute with static python embedding for
manylinux
* Many linux embed interpreter.
* Building wheels manylinux with static embedding
* Better script.
* typo.
* Using a dummy feature?
* default features ?
* Back into order.
* Fixing manylinux ??.
* Local dir.
* Missing star.
* Makedir ?
* Monkey coding this.
* extension module ?
* Building with default features `RustExtension`.
* bdist_wheel + rustextension any better ?
* update rust-py version.
* Forcing extension module.
* No default features.
* Remove py37 out of spite
* Revert "Remove py37 out of spite"
This reverts commit 6ab7facd792b59c2e30be82fe42816d24c32cf0d.
* Really extraneous feature.
* Fix build wheels.
* Putting things back in place.
* add a way to specify the unknown token in `SentencePieceUnigramTokenizer`
* add test that verify that an exception is raised for the missing unknown token
* style
* add test tokens
* Draft functionality of visualization
* Added comments to make code more intelligble
* polish the styles
* Ensure colors are stable and comment the css
* Code clean up
* Made visualizer importable and added some docs
* Fix styling
* implement comments from PR
* Fixed the regex for UNK tokens and examples in notebook
* Converted docs to google format
* Added a notebook showing multiple languages and tokenizers
* Added visual indication of chars that are tokenized with >1 token
* Reorganize things a bit and fix import
* Update docs
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
* Removing all pre_tokenizer logic from Unigram algorithm.
* Improving *a lot* the parity check.
- We can now detect a lot more errors
- Special cases have been added temporarily.
* Adding 2 new normalizers that mimick spm defaut's behavior.
* Adding `encoding_optimized` version of the `encode` algorithm.
- Removes Lattice allocation.
- Changes trie `common_prefix_search` to return an iterator to avoid
allocation of the full results.
* Trie<char> -> Trie<u8> Another improvement on speed.
* [WIP] Attempt to create a Precompiled Normalizer from SPM to be 100%
compliant with arbitrary models.
* Adding a new `Precompiled` Normalizer that is replacing `SpmNmtNfkc`.
- It will be used for direct compatiblity with `Spm` and replace all
their custom rules by using directly the normalizer spec embedded
within spm files, removing all need for any rules for us.
- We need `nom` dependency to parse the binary format of `spm`.
- We need to add `sentencepiece_model_pb2.py` file to be able to read
the proto file.
- We reimplemented their `Darts::DoubleArray` compact trie format.
* Fixing a bug with Precompiled normalizer.
* Fixing some edge cases (now in tests) with this weird precompiled
normalizer.
It seems a very handy crafted trie does not prevent from shooting
oneself in the foot. Sorry future reader.
* Keep API stable for this PR (change of the API should come later #409).
- Removed sentencepiece_model_pb2 from binding and add instructions to
make `from_spm` work.
* Adding model check in `from_spm`.
* Adressing @n1t0's comments.
* Adding a check to make sure alignments stay correct.
Also added a bit more documentation on how Precompiled works.
* Extracting `Precompiled` into it's own `spm_precompiled` crate.
* Using ranges in `do_nmt`.
This allows testing versions not built in-place. Otherwise
importing (or testing) in the package root fails without develop
builds.
Replace maturin with setuptools_rust since maturin fails with
proper project structure.