b81ec467a6
Upgrade python versions.
2024-11-06 13:17:22 +08:00
57884ebaa2
[MINOR:TYPO] Fix docstrings ( #1653 )
...
* [MINOR:TYPO] Update pre_tokenizers.rs
* [MINOR:TYPO] Update __init__.pyi
2024-11-05 16:25:06 +01:00
5e223ceb48
fix pylist ( #1673 )
...
* fix pylist
* add comment about why we use PySequence
* style
* fix encode batch fast as well
* Update bindings/python/src/tokenizer.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
* fix with capacity
* stub :)
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2024-11-05 16:24:23 +01:00
0f3a3f957e
update workflow
2024-11-04 18:38:32 +01:00
7c36735389
v0.20.2-dev.0 version
2024-11-04 18:36:40 +01:00
6c15458868
Bump actions versions ( #1669 )
...
* Update docs-check.yml
Bump actions/setup-python to v5
Bump python-version to 3.12 (default on ubuntu-latest)
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
* Update node-release.yml
Bump actions/setup-python to v5
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump actions/cache to v4
Bump actions/setup-node to v4
Bump actions/upload-artifact to v4
Bump actions/download-artifact to v4
* Update node.yml
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump actions/cache to v4
Bump actions/setup-node to v4
* Update python-release-conda.yml
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump conda-incubator/setup-miniconda to v3
* Update python-release.yml
Bump actions/setup-python to v5
Bump actions/download-artifact to v4
* Update rust-release.yml
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump actions/cache to v4
* Update stale.yml
Bump actions/stale to v9
* Update python.yml
Bump actions/setup-python to v5
2024-11-01 10:19:35 +01:00
6ade8c2d21
PyO3 0.22 ( #1665 )
...
* PyO3 0.22
* Fix python stubs
* Remove name arg from PyModel::save Python signature
---------
Co-authored-by: Dimitris Iliopoulos <diliopoulos@fb.com >
2024-11-01 10:17:23 +01:00
41e0eaa561
Bump actions/checkout to v4 ( #1667 )
...
Signed-off-by: tinyboxvk <tinyboxvk@users.noreply.github.com >
2024-10-29 14:32:07 +01:00
5512a424bf
Add safety comments ( #1651 )
...
* Unsafe comment for from_u32_unchecked
* Add safety comments and type assertion for HashSet parallel iteration
* Add safety comment for String splice
* fixes
* fmt
* pos
2024-10-29 09:44:06 +01:00
6ea758872d
Unsound call of set_var
( #1664 )
...
* refactor: lift cloning to caller
* refactor: do not elide lifetimes as in Rust 2018
* fix: unsound use of env::set_var, was breaking stdlib change to make unsafe
It is generally not safe to set env variables. The correct way to set a config
value that needs to be overridden is to hold a copy internal to the library and
only read from the environment.
2024-10-25 15:44:30 +02:00
a8738a95d1
Arg name correction: auth_token -> token ( #1621 )
...
* Arg name correction: auth_token -> token
* Arg name correction in .rs: auth_token -> token
* update from_pretrained.rs file as well
---------
Co-authored-by: Rene Ravenel <rene@Renes-MacBook-Pro.local >
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com >
2024-10-24 16:32:09 +02:00
9b77c054ef
Fix off-by-one error in tokenizer::normalizer::Range::len ( #1638 )
2024-10-14 08:40:17 +02:00
bce68a60cb
Bump cookie and express in /tokenizers/examples/unstable_wasm/www ( #1648 )
...
Bumps [cookie](https://github.com/jshttp/cookie ) and [express](https://github.com/expressjs/express ). These dependencies needed to be updated together.
Updates `cookie` from 0.6.0 to 0.7.1
- [Release notes](https://github.com/jshttp/cookie/releases )
- [Commits](https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.1 )
Updates `express` from 4.21.0 to 4.21.1
- [Release notes](https://github.com/expressjs/express/releases )
- [Changelog](https://github.com/expressjs/express/blob/4.21.1/History.md )
- [Commits](https://github.com/expressjs/express/compare/4.21.0...4.21.1 )
---
updated-dependencies:
- dependency-name: cookie
dependency-type: indirect
- dependency-name: express
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 15:30:24 +02:00
51826532d4
push new dev version
2024-10-10 12:00:16 +02:00
557fde76d8
style: simplify string formatting for readability ( #1632 )
2024-10-04 13:11:50 +02:00
3d51a1695f
Fix documentation build ( #1642 )
...
* use v4
* fix ruff
* style
2024-10-01 14:48:02 +02:00
294ab86fe0
Bump webpack in /tokenizers/examples/unstable_wasm/www ( #1641 )
...
Bumps [webpack](https://github.com/webpack/webpack ) from 5.76.0 to 5.95.0.
- [Release notes](https://github.com/webpack/webpack/releases )
- [Commits](https://github.com/webpack/webpack/compare/v5.76.0...v5.95.0 )
---
updated-dependencies:
- dependency-name: webpack
dependency-type: direct:development
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-01 14:17:23 +02:00
2b97a5e49e
Bump send and express in /tokenizers/examples/unstable_wasm/www ( #1631 )
...
Bumps [send](https://github.com/pillarjs/send ) and [express](https://github.com/expressjs/express ). These dependencies needed to be updated together.
Updates `send` from 0.18.0 to 0.19.0
- [Release notes](https://github.com/pillarjs/send/releases )
- [Changelog](https://github.com/pillarjs/send/blob/master/HISTORY.md )
- [Commits](https://github.com/pillarjs/send/compare/0.18.0...0.19.0 )
Updates `express` from 4.18.1 to 4.21.0
- [Release notes](https://github.com/expressjs/express/releases )
- [Changelog](https://github.com/expressjs/express/blob/4.21.0/History.md )
- [Commits](https://github.com/expressjs/express/compare/4.18.1...4.21.0 )
---
updated-dependencies:
- dependency-name: send
dependency-type: indirect
- dependency-name: express
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-01 14:17:09 +02:00
077678d1d1
Bump serve-static and express in /tokenizers/examples/unstable_wasm/www ( #1630 )
...
Bumps [serve-static](https://github.com/expressjs/serve-static ) and [express](https://github.com/expressjs/express ). These dependencies needed to be updated together.
Updates `serve-static` from 1.15.0 to 1.16.2
- [Release notes](https://github.com/expressjs/serve-static/releases )
- [Changelog](https://github.com/expressjs/serve-static/blob/v1.16.2/HISTORY.md )
- [Commits](https://github.com/expressjs/serve-static/compare/v1.15.0...v1.16.2 )
Updates `express` from 4.18.1 to 4.21.0
- [Release notes](https://github.com/expressjs/express/releases )
- [Changelog](https://github.com/expressjs/express/blob/4.21.0/History.md )
- [Commits](https://github.com/expressjs/express/compare/4.18.1...4.21.0 )
---
updated-dependencies:
- dependency-name: serve-static
dependency-type: indirect
- dependency-name: express
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-01 14:16:53 +02:00
2204066e78
Bump body-parser and express in /tokenizers/examples/unstable_wasm/www ( #1629 )
...
Bumps [body-parser](https://github.com/expressjs/body-parser ) and [express](https://github.com/expressjs/express ). These dependencies needed to be updated together.
Updates `body-parser` from 1.20.0 to 1.20.3
- [Release notes](https://github.com/expressjs/body-parser/releases )
- [Changelog](https://github.com/expressjs/body-parser/blob/master/HISTORY.md )
- [Commits](https://github.com/expressjs/body-parser/compare/1.20.0...1.20.3 )
Updates `express` from 4.18.1 to 4.21.0
- [Release notes](https://github.com/expressjs/express/releases )
- [Changelog](https://github.com/expressjs/express/blob/4.21.0/History.md )
- [Commits](https://github.com/expressjs/express/compare/4.18.1...4.21.0 )
---
updated-dependencies:
- dependency-name: body-parser
dependency-type: indirect
- dependency-name: express
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-01 14:16:41 +02:00
3fb1371c1c
[ignore_merges
] Fix offsets ( #1640 )
...
* Fix the default offset create
* update the tests
* clippy
2024-10-01 09:22:20 +02:00
b4a38c4f63
Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows ( #1626 )
...
Bumps [actions/download-artifact](https://github.com/actions/download-artifact ) from 3 to 4.1.7.
- [Release notes](https://github.com/actions/download-artifact/releases )
- [Commits](https://github.com/actions/download-artifact/compare/v3...v4.1.7 )
---
updated-dependencies:
- dependency-name: actions/download-artifact
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-30 16:38:28 +02:00
14a07b06e4
fix filelink ( #1610 )
2024-08-12 07:35:33 +02:00
75aef5b75b
Update README.md ( #1608 )
2024-08-09 10:40:21 +02:00
81c471cf17
update dev version 0.20.0
2024-08-08 18:11:50 +02:00
85cc05a32f
Fix CI ( #1607 )
2024-08-08 17:09:30 +02:00
bfd9cdeefb
Perf improvement 16% by removing offsets. ( #1587 )
...
* [Breaking Change] Perf improvement 16% by removing offsets.
Offsets calculation are always calculated in Python land.
By changing it to not being calculated, we win 16% of the runtime.
This is not the total extent of it because offsets are
still calculated in bytes.
* Required features.
* Remove clippy error.
* Make it non breaking and still show perf improvement.
* Even faster without offsets.
* Update doc.
* Fmt.
* Apply suggestions from code review
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
* fmt.
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
2024-08-08 14:56:13 +02:00
bd27fa56d6
add deserialize for pre tokenizers ( #1603 )
...
* add deserialize
* copy from the decoder
* fmt
* clippy
* fix rust tests
* fmt
* don't change the test
2024-08-08 08:38:09 +02:00
56c9c70440
Tests + Deserialization improvement for normalizers. ( #1604 )
2024-08-08 08:38:02 +02:00
49dafd707e
Fix strip python type ( #1602 )
...
* update
* the fix
* Revert "update"
This reverts commit 4c2f32f116479b0ec8ccd7c832f86cbc8787d8a9.
* add a test and rebase
* style
* oups
2024-08-07 15:36:28 +02:00
bded212356
Support None
to reset pre_tokenizers and normalizers, and index sequences ( #1590 )
...
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* add a small test
* styling
* style merge
* fix merge test
* fmt
* nits
* update tset
2024-08-07 12:52:35 +02:00
eea8e1ae6f
Fix doc about split ( #1591 )
...
* update doc
* add example
* Update bindings/python/src/pre_tokenizers.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
* stub
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2024-08-07 12:35:01 +02:00
6a5fce9fa0
Merges cannot handle tokens containing spaces. ( #909 )
...
* Merges cannot handle tokens containing spaces.
This fixes this while keeping backward support.
We don't want to merge that blindly.
* Update the tests.
* Fixing clippy.
* Add a test with spaces in the token/merge.
2024-08-07 12:34:53 +02:00
ab9c7ded8b
Using serde (serde_pyo3) to get __str__ and __repr__ easily. ( #1588 )
...
* Using serde (serde_pyo3) to get __str__ and __repr__ easily.
* Putting it within tokenizers, it needs to be too specific.
* Clippy is our friend.
* Ruff.
* Update the tests.
* Pretty sure this is wrong (#1589 )
* Adding support for ellipsis.
* Fmt.
* Ruff.
* Fixing tokenizer.
---------
Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com >
2024-08-07 12:08:29 +02:00
7a30bca2f3
Updating error messages. ( #1599 )
2024-08-06 16:42:56 +02:00
8f2cc90249
Add test normalizers ( #1600 )
...
* update
* update test they passs
* fmt
2024-08-06 16:08:18 +02:00
fe41687ca8
Better serialization error ( #1595 )
...
* Updating the deserialization error for models.
* Update tokenizers/src/models/mod.rs
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
2024-08-06 13:39:11 +02:00
2d27761f60
Adding a few tests for decoder deserialization.
2024-08-06 13:36:36 +02:00
adc82cb49a
Add-legacy-tests ( #1597 )
...
* add tests
* decoder as well
* check error
* propagate
* lint
* rafiune the test
* lint
* revert decoder changes
* on more?
* fmt
* Update tokenizers/src/pre_tokenizers/mod.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
* fix commit
* simplify err
* fmt
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2024-08-06 13:08:12 +02:00
99a48dcb46
Clippy.
2024-08-06 10:48:39 +02:00
5fb8a2320c
Legacy test.
2024-08-06 10:48:39 +02:00
388014fd6b
Adding some serialization testing around the wrapper.
2024-08-06 10:48:39 +02:00
7b80359dd2
Fixing release CI strict (taken from safetensors).
2024-08-06 09:11:30 +02:00
a010f6b75c
Revert "Using serde (serde_pyo3) to get __str__ and __repr__ easily."
...
This reverts commit 86138337fc
.
2024-08-02 18:42:57 +02:00
86138337fc
Using serde (serde_pyo3) to get __str__ and __repr__ easily.
2024-08-02 18:41:54 +02:00
7415e28536
Enabling the option to use fancy_regex instead of onig
.
2024-08-01 15:53:17 +02:00
9e0c791f2b
Small performance fixup (negligible but obviously better).
2024-08-01 15:52:39 +02:00
1df498a186
Fixing benchmark2.
2024-08-01 15:52:39 +02:00
c6f2c0b057
Fixing the benchmark. ( #1583 )
2024-08-01 10:36:53 +02:00
35f338a7b8
Add benchmark vs tiktoken ( #1582 )
...
* Adding a simple tiktoken benchmark.
* Adding 1 large fused document case.
2024-07-31 17:09:23 +02:00