* CD backports
follow
huggingface/safetensors#317
* fix node bindings?
`cargo check` doesnt work on my local configuration from `tokenizers/bindings/node/native`
i don't think it will be a problem but i have difficulty telling
* backport #315
* safetensors#317 back ports
* Makes `decode` and `decode_batch` work on borrowed content.
* Make `decode_batch` work with borrowed content.
* Fix lint.
* Attempt to map it into Node.
* Second attempt.
* Step by step.
* One more step.
* Fix lint.
* Please ...
* Removing collect.
* Revert "Removing collect."
This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Adding 2 new decoders:
- Fuse will simply concatenate all tokens into 1 string
- Strip will remove n char from left or right
Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want
for the `Metaspace` thing.
- Note: Added a new dependency from better parsing of decoders.
This is due to untagged enums which can match anything the `MustBe`
ensure there's no issue between Fuse and ByteFallback.
Since both are new the chances for backward incompatibility is low.
* Fixing picking/unpickling (using default args.).
* Stub.
* Black.
* Fixing node.
* Adding ByteFallback support for `tokenizers`.
Two items added:
- A flag `byte_fallback` for the `BPE` model. This will be in charge
of using `<0x61>` instead of unk on unknown tokens.
- A ByteFallback decoder, which will be in charge of putting everything
back into string whenever possible. Showing � when the byte decoding
fails (behavior checked against LlamaTokenizer in `transformers`.
* Update rustdoc.
* Clippy + Add BPE(byte_fallback) into bindings.
* Stupid file.
* Test artifacts removed.
* Update stub.
* Fix.
* Bad file.
* CRITICAL FIX: wrapper order because of untagged....
* Remove prints.
* Fixing <16 byte fallback.
* Adding rust audit.
* Update clap version + derive_builder (they clashed).
* Ignoring specific CVE which can be ignored
https://github.com/Azure/iot-identity-service/issues/481
* Updating python lock.
* Revert `derive-builder` update.
* Adding back help msg.
* New version.
The actual release will happen *before* PyO3 0.17.2 because
the tests were ran before than.
* Manylinux2014 necessary now with Rust 1.64.
* Update README.md
Add reference to normalizer blog post
* Update lib.rs
* Fixing PR + clippy on node.
* Update readme to match docstring.
* Other clippy warning.
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* TMP.
* Adding support for pickling Python trainers.
* Remove not warranted files + missed naming updates.
* Stubbing.
* Making sure serialized format is written in python tests.
* tokenizer.save has the wrong arguments compared to documentation
* Fixing doc of `save` function.
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Fixing bad deserialization following inclusion of a default for
`Punctuation`.
* don't remove the type now...
* Adding slow test to run on all the tokenizers of the hub.
* `PartialEq` everywhere.
* Forcing `type` to exist on the `pre_tokenizers`.
* feat(tokenizers): add truncate test case
* !feat(tokenizer): truncate right
* refacto(tokenizers): clippy
* feat(bindings): update bindings for truncate()
* fix(tokenizers): remove unsafe code
* refacto(tokenizers): truncate direction
* truncate direction enum
* compute parts ranges beforehand
* 2n space because encoding is dropped at the end of procedure
* update bindings
* add pip install in python bindings' make test
* fix(node): clippy asks to use unwrap_or_else
* fix(node): lint
* refacto(tokenizers): replace Vec<Range<usize>> by Vec<(usize, usize)>
* refacto(bindings): add match syntax
* refacto(tokenizers): use mem::replace instead of mem::swap
* refacto(tokenizers): assign value the normal way
* Switch git dependencies in Cargo.toml back to regular versions
rayon-cond turned out to be a rustc bug that has been fixed for a while
(see cuviper/rayon-cond#2), so we can revert the git dependency.
numpy has released the commit in question as part of 0.12.
* Also update Cargo.lock files
Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
* Rust - add a CTCDecoder as a seperate mod
* Adding bindings to Node + Python.
* Clippy update.
* Stub.
* Fixing roberta.json URLs.
* Moving test files to hf.co.
* Update cargo check and clippy to 1.52.
* Inner ':' actually is used for domains in sphinx.
Making `domain` work correctly was just too much work so I went the easy
way and have global roles for the custom rust extension.
* Update struct naming and docs
* Update changelog
Co-authored-by: Thomaub <github.thomaub@gmail.com>
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
* start playing around
* make a first version
* refactor
* apply make format
* add python bindings
* add some python binding tests
* correct pre-tokenizers
* update auto-generated bindings
* lint python bindings
* add code node
* add split to docs
* refactor python binding a bit
* cargo fmt
* clippy and fmt in node
* quick updates and fixes
* Oops
* Update node typings
* Update changelog
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>