b4d8dfc3b2
Use ApiBuilder::from_env() in from_pretrained function ( #1737 )
...
Use ApiBuilder::from_env() for builder initialization
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
2025-05-27 12:20:17 +02:00
e5d781d5b9
update pyo3 and rust-numpy depends for no-gil/free-threading compat ( #1774 )
...
Signed-off-by: root <root@gpu-xl.lxd >
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
2025-05-27 11:31:58 +02:00
01f8bc834c
clippy ( #1781 )
...
* clippy
* fmtr
* rutc?
* fix onig issue
* up
* decode stream default
* jump a release for cargo audit ...
* more cliippy stuff
* clippy?
* proper style
* fmt
2025-05-27 11:30:32 +02:00
23e7e42adf
Fix data path in test_continuing_prefix_trainer_mismatch ( #1747 )
2025-05-27 08:48:27 +02:00
fd1b361b76
Bump http-proxy-middleware in /tokenizers/examples/unstable_wasm/www ( #1762 )
...
Bumps [http-proxy-middleware](https://github.com/chimurai/http-proxy-middleware ) from 2.0.6 to 2.0.9.
- [Release notes](https://github.com/chimurai/http-proxy-middleware/releases )
- [Changelog](https://github.com/chimurai/http-proxy-middleware/blob/v2.0.9/CHANGELOG.md )
- [Commits](https://github.com/chimurai/http-proxy-middleware/compare/v2.0.6...v2.0.9 )
---
updated-dependencies:
- dependency-name: http-proxy-middleware
dependency-version: 2.0.9
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-05-27 08:29:50 +02:00
cc01186fd7
Fix type notation of merges in BPE Python binding ( #1766 )
2025-05-27 08:23:58 +02:00
f1faec1756
Fix typos in strings and comments ( #1770 )
2025-05-27 08:17:36 +02:00
67db0cd1dd
Fix no-onig no-wasm builds ( #1772 )
...
Now, you can build with:
```
--no-default-features --features=fancy-regex
```
Which previously didn't work. You had to enable the `unstable_wasm`
flag.
I think using `fancy_regex` without wasm is a valid use-case, as I've
seen extremely slow build times using `onig`.
See: https://github.com/huggingface/tokenizers/issues/1730
Onig also breaks, sometimes, with compiler updates.
See: https://github.com/huggingface/tokenizers/pull/1771
Closes https://github.com/huggingface/tokenizers/issues/1729
2025-05-27 07:44:20 +02:00
759d7aa77a
replace lazy_static with stabilized std::sync::LazyLock in 1.80 ( #1739 )
2025-03-18 17:33:44 +01:00
4383a25787
Update the release builds following 0.21.1. ( #1746 )
...
* Update the release builds following 0.21.1.
* Clippy fix.
2025-03-13 13:01:41 +01:00
4f1a810aa2
Add rustls-tls feature ( #1732 )
2025-02-11 10:57:05 +01:00
fbe3365a13
Update metadata as Python3.7 and Python3.8 support was dropped ( #1724 )
...
* Update metadata as python3.7 and python3.8 support was dropped
* Format pyproject.toml: unify quotes and indentation
2025-02-11 10:52:59 +01:00
c45aebd102
🚨 Support updating template processors ( #1652 )
...
* current updates
* simplify
* set_item works, but `tokenizer._tokenizer.post_processor[1].single = ["$0", "</s>"]` does not !
* fix: `normalizers` deserialization and other refactoring
* fix: `pre_tokenizer` deserialization
* feat: add `__len__` implementation for `normalizer::PySequence`
* feat: add `__setitem__` impl for `normalizers::PySequence`
* feat: add `__setitem__` impl to `pre_tokenizer::PySequence`
* feat: add `__setitem__` impl to `post_processor::PySequence`
* test: add normalizer sequence setter check
* refactor: allow unused `processors::setter` macro
* test: add `__setitem__` test for processors & pretok
* refactor: `unwrap` -> `PyException::new_err()?`
* refactor: fmt
* refactor: remove unnecessary `pub`
* feat(bindings): add missing getters & setters for pretoks
* feat(bindings): add missing getters & setters for processors
* refactor(bindings): rewrite RwLock poison error msg
* refactor: remove debug print
* feat(bindings): add description as to why custom deser is needed
* feat: make post proc sequence elements mutable
* fix(binding): serialization
---------
Co-authored-by: Luc Georges <luc.sydney.georges@gmail.com >
2025-01-28 14:58:35 +01:00
e7ed39de3c
Fixing NormalizedString append when normalized is empty. ( #1717 )
...
Co-authored-by: Anantha Kandrapu <anantkan@amazon.com >
2025-01-09 17:41:32 +01:00
0ff2ab0f64
Fixing the stream by removing the read_index altogether. ( #1716 )
...
* Fixing the stream by removing the read_index altogether.
* Moving the test location because.. Windows.
* Ok whatever.
* Rust 1.84
* Fmt.
2025-01-09 17:41:15 +01:00
862d1a346a
Fix panic in DecodeStream::step due to incorrect index usage ( #1699 )
...
* Add a failing test for step_decode_stream
* Improve test case for test_decode_stream_step_no_panic
* Fix subtract with overflow issue in step_decode_stream
2025-01-09 13:24:04 +01:00
c04b97aab1
Update documentation of Rust feature ( #1711 )
...
* Update documentation of Rust feature
* Synchronize README.md and src/lib.rs
2025-01-09 12:08:45 +01:00
bdfc38b78d
Fix typos ( #1715 )
...
* Fix typos
Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com >
* Update docs/source/quicktour.rst
* Update docs/source-doc-builder/quicktour.mdx
---------
Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com >
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2025-01-09 11:53:20 +01:00
6945933829
update Split pretokenizer docstrings ( #1701 )
2025-01-08 12:35:52 +01:00
166edd87c8
Fixing the README. ( #1714 )
2025-01-08 12:31:17 +01:00
3a6504d274
Upgrade to PyO3 0.23 ( #1708 )
...
* Upgrade to PyO3 0.23
* Macos-12 deprecated?
* Clippy.
* Clippy auto ellision.
2024-12-31 18:36:01 +01:00
555d44c47a
Add feature flag hint to README.md, fixes #1633 ( #1709 )
2024-12-30 17:01:53 +01:00
24d29f498d
Update dev version and pyproject.toml ( #1693 )
...
* update pyproject.toml
* update py dev version
2024-11-27 16:01:48 +01:00
1bf2a66b80
v0.20.4-dev0
2024-11-27 10:07:49 +01:00
eb4cc86d4e
Bump cross-spawn from 6.0.5 to 6.0.6 in /bindings/node ( #1687 )
...
Bumps [cross-spawn](https://github.com/moxystudio/node-cross-spawn ) from 6.0.5 to 6.0.6.
- [Changelog](https://github.com/moxystudio/node-cross-spawn/blob/v6.0.6/CHANGELOG.md )
- [Commits](https://github.com/moxystudio/node-cross-spawn/compare/v6.0.5...v6.0.6 )
---
updated-dependencies:
- dependency-name: cross-spawn
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-25 10:04:06 +01:00
ac34660e44
Fix encode_batch and encode_batch_fast to accept ndarrays again ( #1679 )
...
* Fix encode_batch and encode_batch_fast to accept ndarrays again
* Fix clippy
---------
Co-authored-by: Dimitris Iliopoulos <diliopoulos@fb.com >
2024-11-21 11:55:11 +01:00
f0c48bd89a
Update README.md with install from source
2024-11-15 21:51:39 +01:00
cc5fb01a2f
Decode stream python ( #1678 )
...
* Python binding for decode stream
Different API because Python cannot handle lifetimes properly.
* Clippy.
2024-11-15 12:06:22 +01:00
500db282a8
Adding an API for decode streaming. ( #1677 )
...
* Adding an API for decode streaming.
* Add another missing test case (proving the effect of state.)
* Ellide lifetime.
* Ellide bis.
* Fixing the streaming implementation.
* Adding more docs.
* End of list.
* Fix internal link.
* Skip doctest on Windows (no tokenizer file because no make)
2024-11-15 06:02:38 +01:00
f4c9fd7f40
Testing ABI3 wheels to reduce number of wheels ( #1674 )
...
* Testing ABI3 wheels to reduce number of wheels
* No need for py-clone anymore.
* Upgrade python versions.
* Remove those flakes.
* Promoting new CI + Fixing secret.
2024-11-15 06:02:22 +01:00
5aa9f6cff0
Disable caching for long strings. ( #1676 )
2024-11-07 14:36:27 +01:00
c6b5c3eab7
More cache options. ( #1675 )
...
* More cache options.
* Fixing error messages.
2024-11-06 11:12:09 +01:00
1740bff7a6
Revert "Upgrade python versions."
...
This reverts commit b81ec467a6
.
2024-11-06 13:18:03 +08:00
b81ec467a6
Upgrade python versions.
2024-11-06 13:17:22 +08:00
57884ebaa2
[MINOR:TYPO] Fix docstrings ( #1653 )
...
* [MINOR:TYPO] Update pre_tokenizers.rs
* [MINOR:TYPO] Update __init__.pyi
2024-11-05 16:25:06 +01:00
5e223ceb48
fix pylist ( #1673 )
...
* fix pylist
* add comment about why we use PySequence
* style
* fix encode batch fast as well
* Update bindings/python/src/tokenizer.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
* fix with capacity
* stub :)
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2024-11-05 16:24:23 +01:00
0f3a3f957e
update workflow
2024-11-04 18:38:32 +01:00
7c36735389
v0.20.2-dev.0 version
2024-11-04 18:36:40 +01:00
6c15458868
Bump actions versions ( #1669 )
...
* Update docs-check.yml
Bump actions/setup-python to v5
Bump python-version to 3.12 (default on ubuntu-latest)
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
* Update node-release.yml
Bump actions/setup-python to v5
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump actions/cache to v4
Bump actions/setup-node to v4
Bump actions/upload-artifact to v4
Bump actions/download-artifact to v4
* Update node.yml
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump actions/cache to v4
Bump actions/setup-node to v4
* Update python-release-conda.yml
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump conda-incubator/setup-miniconda to v3
* Update python-release.yml
Bump actions/setup-python to v5
Bump actions/download-artifact to v4
* Update rust-release.yml
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump actions/cache to v4
* Update stale.yml
Bump actions/stale to v9
* Update python.yml
Bump actions/setup-python to v5
2024-11-01 10:19:35 +01:00
6ade8c2d21
PyO3 0.22 ( #1665 )
...
* PyO3 0.22
* Fix python stubs
* Remove name arg from PyModel::save Python signature
---------
Co-authored-by: Dimitris Iliopoulos <diliopoulos@fb.com >
2024-11-01 10:17:23 +01:00
41e0eaa561
Bump actions/checkout to v4 ( #1667 )
...
Signed-off-by: tinyboxvk <tinyboxvk@users.noreply.github.com >
2024-10-29 14:32:07 +01:00
5512a424bf
Add safety comments ( #1651 )
...
* Unsafe comment for from_u32_unchecked
* Add safety comments and type assertion for HashSet parallel iteration
* Add safety comment for String splice
* fixes
* fmt
* pos
2024-10-29 09:44:06 +01:00
6ea758872d
Unsound call of set_var
( #1664 )
...
* refactor: lift cloning to caller
* refactor: do not elide lifetimes as in Rust 2018
* fix: unsound use of env::set_var, was breaking stdlib change to make unsafe
It is generally not safe to set env variables. The correct way to set a config
value that needs to be overridden is to hold a copy internal to the library and
only read from the environment.
2024-10-25 15:44:30 +02:00
a8738a95d1
Arg name correction: auth_token -> token ( #1621 )
...
* Arg name correction: auth_token -> token
* Arg name correction in .rs: auth_token -> token
* update from_pretrained.rs file as well
---------
Co-authored-by: Rene Ravenel <rene@Renes-MacBook-Pro.local >
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com >
2024-10-24 16:32:09 +02:00
9b77c054ef
Fix off-by-one error in tokenizer::normalizer::Range::len ( #1638 )
2024-10-14 08:40:17 +02:00
bce68a60cb
Bump cookie and express in /tokenizers/examples/unstable_wasm/www ( #1648 )
...
Bumps [cookie](https://github.com/jshttp/cookie ) and [express](https://github.com/expressjs/express ). These dependencies needed to be updated together.
Updates `cookie` from 0.6.0 to 0.7.1
- [Release notes](https://github.com/jshttp/cookie/releases )
- [Commits](https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.1 )
Updates `express` from 4.21.0 to 4.21.1
- [Release notes](https://github.com/expressjs/express/releases )
- [Changelog](https://github.com/expressjs/express/blob/4.21.1/History.md )
- [Commits](https://github.com/expressjs/express/compare/4.21.0...4.21.1 )
---
updated-dependencies:
- dependency-name: cookie
dependency-type: indirect
- dependency-name: express
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 15:30:24 +02:00
51826532d4
push new dev version
2024-10-10 12:00:16 +02:00
557fde76d8
style: simplify string formatting for readability ( #1632 )
2024-10-04 13:11:50 +02:00
3d51a1695f
Fix documentation build ( #1642 )
...
* use v4
* fix ruff
* style
2024-10-01 14:48:02 +02:00
294ab86fe0
Bump webpack in /tokenizers/examples/unstable_wasm/www ( #1641 )
...
Bumps [webpack](https://github.com/webpack/webpack ) from 5.76.0 to 5.95.0.
- [Release notes](https://github.com/webpack/webpack/releases )
- [Commits](https://github.com/webpack/webpack/compare/v5.76.0...v5.95.0 )
---
updated-dependencies:
- dependency-name: webpack
dependency-type: direct:development
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-01 14:17:23 +02:00