* Update docs-check.yml
Bump actions/setup-python to v5
Bump python-version to 3.12 (default on ubuntu-latest)
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
* Update node-release.yml
Bump actions/setup-python to v5
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump actions/cache to v4
Bump actions/setup-node to v4
Bump actions/upload-artifact to v4
Bump actions/download-artifact to v4
* Update node.yml
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump actions/cache to v4
Bump actions/setup-node to v4
* Update python-release-conda.yml
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump conda-incubator/setup-miniconda to v3
* Update python-release.yml
Bump actions/setup-python to v5
Bump actions/download-artifact to v4
* Update rust-release.yml
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump actions/cache to v4
* Update stale.yml
Bump actions/stale to v9
* Update python.yml
Bump actions/setup-python to v5
* refactor: lift cloning to caller
* refactor: do not elide lifetimes as in Rust 2018
* fix: unsound use of env::set_var, was breaking stdlib change to make unsafe
It is generally not safe to set env variables. The correct way to set a config
value that needs to be overridden is to hold a copy internal to the library and
only read from the environment.
* [Breaking Change] Perf improvement 16% by removing offsets.
Offsets calculation are always calculated in Python land.
By changing it to not being calculated, we win 16% of the runtime.
This is not the total extent of it because offsets are
still calculated in bytes.
* Required features.
* Remove clippy error.
* Make it non breaking and still show perf improvement.
* Even faster without offsets.
* Update doc.
* Fmt.
* Apply suggestions from code review
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* fmt.
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* add a small test
* styling
* style merge
* fix merge test
* fmt
* nits
* update tset
* Merges cannot handle tokens containing spaces.
This fixes this while keeping backward support.
We don't want to merge that blindly.
* Update the tests.
* Fixing clippy.
* Add a test with spaces in the token/merge.
* Using serde (serde_pyo3) to get __str__ and __repr__ easily.
* Putting it within tokenizers, it needs to be too specific.
* Clippy is our friend.
* Ruff.
* Update the tests.
* Pretty sure this is wrong (#1589)
* Adding support for ellipsis.
* Fmt.
* Ruff.
* Fixing tokenizer.
---------
Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
* feature dependent test
* nit about 嗎
* update
* actuallyfix it
* update the test
add it
fix
* stub
* Update tokenizers/src/pre_tokenizers/byte_level.rs
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* skip failing test
* add normalizer to init
---------
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)"
This reverts commit 25aee8b88c.
* don't remove audit
* deprecate id_to_token
* use simple id to token
* don't break id_to_token since we are deprecating anyways?