* Adding 2 new decoders:
- Fuse will simply concatenate all tokens into 1 string
- Strip will remove n char from left or right
Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want
for the `Metaspace` thing.
- Note: Added a new dependency from better parsing of decoders.
This is due to untagged enums which can match anything the `MustBe`
ensure there's no issue between Fuse and ByteFallback.
Since both are new the chances for backward incompatibility is low.
* Fixing picking/unpickling (using default args.).
* Stub.
* Black.
* Fixing node.
* Adding ByteFallback support for `tokenizers`.
Two items added:
- A flag `byte_fallback` for the `BPE` model. This will be in charge
of using `<0x61>` instead of unk on unknown tokens.
- A ByteFallback decoder, which will be in charge of putting everything
back into string whenever possible. Showing � when the byte decoding
fails (behavior checked against LlamaTokenizer in `transformers`.
* Update rustdoc.
* Clippy + Add BPE(byte_fallback) into bindings.
* Stupid file.
* Test artifacts removed.
* Update stub.
* Fix.
* Bad file.
* CRITICAL FIX: wrapper order because of untagged....
* Remove prints.
* Fixing <16 byte fallback.
* [fix] Use unk_token
In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used.
* [fix] If unk_token is None, this case is also considered.
* Update bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* [FIX] In CharBPETokenizer, Use unk_token.
In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used.
* Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Fixing conda build ?
* Reduce the scope to speedup testing.
* Reduce more.
* Trying to link to conda lib.
* Trying to enable `pkg-config` on the codna env.
* Really publish.
* Update conda builds.
* Remove 3.11
* Putting releases back onto release track.
* [fix] Use unk_token
In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used.
* [fix] If unk_token is None, this case is also considered.
* Update bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Include license file in Rust crate
* Ignore security warning.
* Also for python.
* Upgrading ubuntu version.
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Adding rust audit.
* Update clap version + derive_builder (they clashed).
* Ignoring specific CVE which can be ignored
https://github.com/Azure/iot-identity-service/issues/481
* Updating python lock.
* Revert `derive-builder` update.
* Adding back help msg.
* New version.
The actual release will happen *before* PyO3 0.17.2 because
the tests were ran before than.
* Manylinux2014 necessary now with Rust 1.64.
* Fixing roberta type ids (everything is zero).
* We need to fix type_ids for all sequence even when not changing
anything else.
* Fixing tests hopefully better.
* Removing dead file.
* Checking that we can distribute with static python embedding for
manylinux
* Many linux embed interpreter.
* Building wheels manylinux with static embedding
* Better script.
* typo.
* Using a dummy feature?
* default features ?
* Back into order.
* Fixing manylinux ??.
* Local dir.
* Missing star.
* Makedir ?
* Monkey coding this.
* extension module ?
* Building with default features `RustExtension`.
* bdist_wheel + rustextension any better ?
* update rust-py version.
* Forcing extension module.
* No default features.
* Remove py37 out of spite
* Revert "Remove py37 out of spite"
This reverts commit 6ab7facd792b59c2e30be82fe42816d24c32cf0d.
* Really extraneous feature.
* Fix build wheels.
* Putting things back in place.