Commit Graph

1844 Commits

Author SHA1 Message Date
eb4cc86d4e Bump cross-spawn from 6.0.5 to 6.0.6 in /bindings/node (#1687)
Bumps [cross-spawn](https://github.com/moxystudio/node-cross-spawn) from 6.0.5 to 6.0.6.
- [Changelog](https://github.com/moxystudio/node-cross-spawn/blob/v6.0.6/CHANGELOG.md)
- [Commits](https://github.com/moxystudio/node-cross-spawn/compare/v6.0.5...v6.0.6)

---
updated-dependencies:
- dependency-name: cross-spawn
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-25 10:04:06 +01:00
ac34660e44 Fix encode_batch and encode_batch_fast to accept ndarrays again (#1679)
* Fix encode_batch and encode_batch_fast to accept ndarrays again

* Fix clippy

---------

Co-authored-by: Dimitris Iliopoulos <diliopoulos@fb.com>
2024-11-21 11:55:11 +01:00
f0c48bd89a Update README.md with install from source 2024-11-15 21:51:39 +01:00
cc5fb01a2f Decode stream python (#1678)
* Python binding for decode stream

Different API because Python cannot handle lifetimes properly.

* Clippy.
2024-11-15 12:06:22 +01:00
500db282a8 Adding an API for decode streaming. (#1677)
* Adding an API for decode streaming.

* Add another missing test case (proving the effect of state.)

* Ellide lifetime.

* Ellide bis.

* Fixing the streaming implementation.

* Adding more docs.

* End of list.

* Fix internal link.

* Skip doctest on Windows (no tokenizer file because no make)
2024-11-15 06:02:38 +01:00
f4c9fd7f40 Testing ABI3 wheels to reduce number of wheels (#1674)
* Testing ABI3 wheels to reduce number of wheels

* No need for py-clone  anymore.

* Upgrade python versions.

* Remove those flakes.

* Promoting new CI + Fixing secret.
2024-11-15 06:02:22 +01:00
5aa9f6cff0 Disable caching for long strings. (#1676) 2024-11-07 14:36:27 +01:00
c6b5c3eab7 More cache options. (#1675)
* More cache options.

* Fixing error messages.
2024-11-06 11:12:09 +01:00
1740bff7a6 Revert "Upgrade python versions."
This reverts commit b81ec467a6.
2024-11-06 13:18:03 +08:00
b81ec467a6 Upgrade python versions. 2024-11-06 13:17:22 +08:00
57884ebaa2 [MINOR:TYPO] Fix docstrings (#1653)
* [MINOR:TYPO] Update pre_tokenizers.rs

* [MINOR:TYPO] Update __init__.pyi
2024-11-05 16:25:06 +01:00
5e223ceb48 fix pylist (#1673)
* fix pylist

* add comment about why we use PySequence

* style

* fix encode batch fast as well

* Update bindings/python/src/tokenizer.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* fix with capacity

* stub :)

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-11-05 16:24:23 +01:00
0f3a3f957e update workflow 2024-11-04 18:38:32 +01:00
7c36735389 v0.20.2-dev.0 version 2024-11-04 18:36:40 +01:00
6c15458868 Bump actions versions (#1669)
* Update docs-check.yml

Bump actions/setup-python to v5
Bump python-version to 3.12 (default on ubuntu-latest)
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained

* Update node-release.yml

Bump actions/setup-python to v5
Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump actions/cache to v4
Bump actions/setup-node to v4
Bump actions/upload-artifact to v4
Bump actions/download-artifact to v4

* Update node.yml

Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump actions/cache to v4
Bump actions/setup-node to v4

* Update python-release-conda.yml

Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump conda-incubator/setup-miniconda to v3

* Update python-release.yml

Bump actions/setup-python to v5
Bump actions/download-artifact to v4

* Update rust-release.yml

Switch actions-rs/toolchain to dtolnay/rust-toolchain as the former one is no longer maintained
Bump actions/cache to v4

* Update stale.yml

Bump actions/stale to v9

* Update python.yml

Bump actions/setup-python to v5
2024-11-01 10:19:35 +01:00
6ade8c2d21 PyO3 0.22 (#1665)
* PyO3 0.22

* Fix python stubs

* Remove name arg from PyModel::save Python signature

---------

Co-authored-by: Dimitris Iliopoulos <diliopoulos@fb.com>
2024-11-01 10:17:23 +01:00
41e0eaa561 Bump actions/checkout to v4 (#1667)
Signed-off-by: tinyboxvk <tinyboxvk@users.noreply.github.com>
2024-10-29 14:32:07 +01:00
5512a424bf Add safety comments (#1651)
* Unsafe comment for from_u32_unchecked

* Add safety comments and type assertion for HashSet parallel iteration

* Add safety comment for String splice

* fixes

* fmt

* pos
2024-10-29 09:44:06 +01:00
6ea758872d Unsound call of set_var (#1664)
* refactor: lift cloning to caller

* refactor: do not elide lifetimes as in Rust 2018

* fix: unsound use of env::set_var, was breaking stdlib change to make unsafe

It is generally not safe to set env variables. The correct way to set a config
value that needs to be overridden is to hold a copy internal to the library and
only read from the environment.
2024-10-25 15:44:30 +02:00
a8738a95d1 Arg name correction: auth_token -> token (#1621)
* Arg name correction: auth_token -> token

* Arg name correction in .rs: auth_token -> token

* update from_pretrained.rs file as well

---------

Co-authored-by: Rene Ravenel <rene@Renes-MacBook-Pro.local>
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
2024-10-24 16:32:09 +02:00
9b77c054ef Fix off-by-one error in tokenizer::normalizer::Range::len (#1638) 2024-10-14 08:40:17 +02:00
bce68a60cb Bump cookie and express in /tokenizers/examples/unstable_wasm/www (#1648)
Bumps [cookie](https://github.com/jshttp/cookie) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together.

Updates `cookie` from 0.6.0 to 0.7.1
- [Release notes](https://github.com/jshttp/cookie/releases)
- [Commits](https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.1)

Updates `express` from 4.21.0 to 4.21.1
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/4.21.1/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.21.0...4.21.1)

---
updated-dependencies:
- dependency-name: cookie
  dependency-type: indirect
- dependency-name: express
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 15:30:24 +02:00
51826532d4 push new dev version 2024-10-10 12:00:16 +02:00
557fde76d8 style: simplify string formatting for readability (#1632) 2024-10-04 13:11:50 +02:00
3d51a1695f Fix documentation build (#1642)
* use v4

* fix ruff

* style
2024-10-01 14:48:02 +02:00
294ab86fe0 Bump webpack in /tokenizers/examples/unstable_wasm/www (#1641)
Bumps [webpack](https://github.com/webpack/webpack) from 5.76.0 to 5.95.0.
- [Release notes](https://github.com/webpack/webpack/releases)
- [Commits](https://github.com/webpack/webpack/compare/v5.76.0...v5.95.0)

---
updated-dependencies:
- dependency-name: webpack
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-01 14:17:23 +02:00
2b97a5e49e Bump send and express in /tokenizers/examples/unstable_wasm/www (#1631)
Bumps [send](https://github.com/pillarjs/send) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together.

Updates `send` from 0.18.0 to 0.19.0
- [Release notes](https://github.com/pillarjs/send/releases)
- [Changelog](https://github.com/pillarjs/send/blob/master/HISTORY.md)
- [Commits](https://github.com/pillarjs/send/compare/0.18.0...0.19.0)

Updates `express` from 4.18.1 to 4.21.0
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/4.21.0/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.18.1...4.21.0)

---
updated-dependencies:
- dependency-name: send
  dependency-type: indirect
- dependency-name: express
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-01 14:17:09 +02:00
077678d1d1 Bump serve-static and express in /tokenizers/examples/unstable_wasm/www (#1630)
Bumps [serve-static](https://github.com/expressjs/serve-static) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together.

Updates `serve-static` from 1.15.0 to 1.16.2
- [Release notes](https://github.com/expressjs/serve-static/releases)
- [Changelog](https://github.com/expressjs/serve-static/blob/v1.16.2/HISTORY.md)
- [Commits](https://github.com/expressjs/serve-static/compare/v1.15.0...v1.16.2)

Updates `express` from 4.18.1 to 4.21.0
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/4.21.0/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.18.1...4.21.0)

---
updated-dependencies:
- dependency-name: serve-static
  dependency-type: indirect
- dependency-name: express
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-01 14:16:53 +02:00
2204066e78 Bump body-parser and express in /tokenizers/examples/unstable_wasm/www (#1629)
Bumps [body-parser](https://github.com/expressjs/body-parser) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together.

Updates `body-parser` from 1.20.0 to 1.20.3
- [Release notes](https://github.com/expressjs/body-parser/releases)
- [Changelog](https://github.com/expressjs/body-parser/blob/master/HISTORY.md)
- [Commits](https://github.com/expressjs/body-parser/compare/1.20.0...1.20.3)

Updates `express` from 4.18.1 to 4.21.0
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/4.21.0/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.18.1...4.21.0)

---
updated-dependencies:
- dependency-name: body-parser
  dependency-type: indirect
- dependency-name: express
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-01 14:16:41 +02:00
3fb1371c1c [ignore_merges] Fix offsets (#1640)
* Fix the default offset create

* update the tests

* clippy
2024-10-01 09:22:20 +02:00
b4a38c4f63 Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows (#1626)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 3 to 4.1.7.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v3...v4.1.7)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-30 16:38:28 +02:00
14a07b06e4 fix filelink (#1610) 2024-08-12 07:35:33 +02:00
75aef5b75b Update README.md (#1608) 2024-08-09 10:40:21 +02:00
81c471cf17 update dev version 0.20.0 2024-08-08 18:11:50 +02:00
85cc05a32f Fix CI (#1607) 2024-08-08 17:09:30 +02:00
bfd9cdeefb Perf improvement 16% by removing offsets. (#1587)
* [Breaking Change] Perf improvement 16% by removing offsets.

Offsets calculation are always calculated in Python land.
By changing it to not being calculated, we win 16% of the runtime.

This is not the total extent of it because offsets are
still calculated in bytes.

* Required features.

* Remove clippy error.

* Make it non breaking and still show perf improvement.

* Even faster without offsets.

* Update doc.

* Fmt.

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fmt.

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2024-08-08 14:56:13 +02:00
bd27fa56d6 add deserialize for pre tokenizers (#1603)
* add deserialize

* copy from the decoder

* fmt

* clippy

* fix rust tests

* fmt

* don't change the test
2024-08-08 08:38:09 +02:00
56c9c70440 Tests + Deserialization improvement for normalizers. (#1604) 2024-08-08 08:38:02 +02:00
49dafd707e Fix strip python type (#1602)
* update

* the fix

* Revert "update"

This reverts commit 4c2f32f116479b0ec8ccd7c832f86cbc8787d8a9.

* add a test and rebase

* style

* oups
2024-08-07 15:36:28 +02:00
bded212356 Support None to reset pre_tokenizers and normalizers, and index sequences (#1590)
* initial commit

* support None

* fix clippy

* cleanup

* clean?

* propagate to pre_tokenizer

* fix test

* fix rust tests

* fix node

* propagate to decoder and post processor

* fix calls

* lint

* fmt

* node be happy I am fixing you

* initial commit

* support None

* fix clippy

* cleanup

* clean?

* propagate to pre_tokenizer

* fix test

* fix rust tests

* fix node

* propagate to decoder and post processor

* fix calls

* lint

* fmt

* node be happy I am fixing you

* add a small test

* styling

* style merge

* fix merge test

* fmt

* nits

* update tset
2024-08-07 12:52:35 +02:00
eea8e1ae6f Fix doc about split (#1591)
* update doc

* add example

* Update bindings/python/src/pre_tokenizers.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* stub

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-08-07 12:35:01 +02:00
6a5fce9fa0 Merges cannot handle tokens containing spaces. (#909)
* Merges cannot handle tokens containing spaces.

This fixes this while keeping backward support.
We don't want to merge that blindly.

* Update the tests.

* Fixing clippy.

* Add a test with spaces in the token/merge.
2024-08-07 12:34:53 +02:00
ab9c7ded8b Using serde (serde_pyo3) to get __str__ and __repr__ easily. (#1588)
* Using serde (serde_pyo3) to get __str__ and __repr__ easily.

* Putting it within tokenizers, it needs to be too specific.

* Clippy is our friend.

* Ruff.

* Update the tests.

* Pretty sure this is wrong (#1589)

* Adding support for ellipsis.

* Fmt.

* Ruff.

* Fixing tokenizer.

---------

Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
2024-08-07 12:08:29 +02:00
7a30bca2f3 Updating error messages. (#1599) 2024-08-06 16:42:56 +02:00
8f2cc90249 Add test normalizers (#1600)
* update

* update test they passs

* fmt
2024-08-06 16:08:18 +02:00
fe41687ca8 Better serialization error (#1595)
* Updating the deserialization error for models.

* Update tokenizers/src/models/mod.rs

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2024-08-06 13:39:11 +02:00
2d27761f60 Adding a few tests for decoder deserialization. 2024-08-06 13:36:36 +02:00
adc82cb49a Add-legacy-tests (#1597)
* add tests

* decoder as well

* check error

* propagate

* lint

* rafiune the test

* lint

* revert decoder changes

* on more?

* fmt

* Update tokenizers/src/pre_tokenizers/mod.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* fix commit

* simplify err

* fmt

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-08-06 13:08:12 +02:00
99a48dcb46 Clippy. 2024-08-06 10:48:39 +02:00
5fb8a2320c Legacy test. 2024-08-06 10:48:39 +02:00