797 Commits

Author SHA1 Message Date
e5d781d5b9 update pyo3 and rust-numpy depends for no-gil/free-threading compat (#1774)
Signed-off-by: root <root@gpu-xl.lxd>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2025-05-27 11:31:58 +02:00
01f8bc834c clippy (#1781)
* clippy

* fmtr

* rutc?

* fix onig issue

* up

* decode stream default

* jump a release for cargo audit ...

* more cliippy stuff

* clippy?

* proper style

* fmt
2025-05-27 11:30:32 +02:00
23e7e42adf Fix data path in test_continuing_prefix_trainer_mismatch (#1747) 2025-05-27 08:48:27 +02:00
cc01186fd7 Fix type notation of merges in BPE Python binding (#1766) 2025-05-27 08:23:58 +02:00
f1faec1756 Fix typos in strings and comments (#1770) 2025-05-27 08:17:36 +02:00
4383a25787 Update the release builds following 0.21.1. (#1746)
* Update the release builds following 0.21.1.

* Clippy fix.
2025-03-13 13:01:41 +01:00
fbe3365a13 Update metadata as Python3.7 and Python3.8 support was dropped (#1724)
* Update metadata as python3.7 and python3.8 support was dropped

* Format pyproject.toml: unify quotes and indentation
2025-02-11 10:52:59 +01:00
c45aebd102 🚨 Support updating template processors (#1652)
* current updates

* simplify

* set_item works, but `tokenizer._tokenizer.post_processor[1].single = ["$0", "</s>"]` does not !

* fix: `normalizers` deserialization and other refactoring

* fix: `pre_tokenizer` deserialization

* feat: add `__len__` implementation for `normalizer::PySequence`

* feat: add `__setitem__` impl for `normalizers::PySequence`

* feat: add `__setitem__` impl to `pre_tokenizer::PySequence`

* feat: add `__setitem__` impl to `post_processor::PySequence`

* test: add normalizer sequence setter check

* refactor: allow unused `processors::setter` macro

* test: add `__setitem__` test for processors & pretok

* refactor: `unwrap` -> `PyException::new_err()?`

* refactor: fmt

* refactor: remove unnecessary `pub`

* feat(bindings): add missing getters & setters for pretoks

* feat(bindings): add missing getters & setters for processors

* refactor(bindings): rewrite RwLock poison error msg

* refactor: remove debug print

* feat(bindings): add description as to why custom deser is needed

* feat: make post proc sequence elements mutable

* fix(binding): serialization

---------

Co-authored-by: Luc Georges <luc.sydney.georges@gmail.com>
2025-01-28 14:58:35 +01:00
0ff2ab0f64 Fixing the stream by removing the read_index altogether. (#1716)
* Fixing the stream by removing the read_index altogether.

* Moving the test location because.. Windows.

* Ok whatever.

* Rust 1.84

* Fmt.
2025-01-09 17:41:15 +01:00
bdfc38b78d Fix typos (#1715)
* Fix typos

Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com>

* Update docs/source/quicktour.rst

* Update docs/source-doc-builder/quicktour.mdx

---------

Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-01-09 11:53:20 +01:00
6945933829 update Split pretokenizer docstrings (#1701) 2025-01-08 12:35:52 +01:00
3a6504d274 Upgrade to PyO3 0.23 (#1708)
* Upgrade to PyO3 0.23

* Macos-12 deprecated?

* Clippy.

* Clippy auto ellision.
2024-12-31 18:36:01 +01:00
24d29f498d Update dev version and pyproject.toml (#1693)
* update pyproject.toml

* update py dev version
2024-11-27 16:01:48 +01:00
1bf2a66b80 v0.20.4-dev0 2024-11-27 10:07:49 +01:00
ac34660e44 Fix encode_batch and encode_batch_fast to accept ndarrays again (#1679)
* Fix encode_batch and encode_batch_fast to accept ndarrays again

* Fix clippy

---------

Co-authored-by: Dimitris Iliopoulos <diliopoulos@fb.com>
2024-11-21 11:55:11 +01:00
cc5fb01a2f Decode stream python (#1678)
* Python binding for decode stream

Different API because Python cannot handle lifetimes properly.

* Clippy.
2024-11-15 12:06:22 +01:00
f4c9fd7f40 Testing ABI3 wheels to reduce number of wheels (#1674)
* Testing ABI3 wheels to reduce number of wheels

* No need for py-clone  anymore.

* Upgrade python versions.

* Remove those flakes.

* Promoting new CI + Fixing secret.
2024-11-15 06:02:22 +01:00
c6b5c3eab7 More cache options. (#1675)
* More cache options.

* Fixing error messages.
2024-11-06 11:12:09 +01:00
57884ebaa2 [MINOR:TYPO] Fix docstrings (#1653)
* [MINOR:TYPO] Update pre_tokenizers.rs

* [MINOR:TYPO] Update __init__.pyi
2024-11-05 16:25:06 +01:00
5e223ceb48 fix pylist (#1673)
* fix pylist

* add comment about why we use PySequence

* style

* fix encode batch fast as well

* Update bindings/python/src/tokenizer.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* fix with capacity

* stub :)

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-11-05 16:24:23 +01:00
7c36735389 v0.20.2-dev.0 version 2024-11-04 18:36:40 +01:00
6ade8c2d21 PyO3 0.22 (#1665)
* PyO3 0.22

* Fix python stubs

* Remove name arg from PyModel::save Python signature

---------

Co-authored-by: Dimitris Iliopoulos <diliopoulos@fb.com>
2024-11-01 10:17:23 +01:00
6ea758872d Unsound call of set_var (#1664)
* refactor: lift cloning to caller

* refactor: do not elide lifetimes as in Rust 2018

* fix: unsound use of env::set_var, was breaking stdlib change to make unsafe

It is generally not safe to set env variables. The correct way to set a config
value that needs to be overridden is to hold a copy internal to the library and
only read from the environment.
2024-10-25 15:44:30 +02:00
a8738a95d1 Arg name correction: auth_token -> token (#1621)
* Arg name correction: auth_token -> token

* Arg name correction in .rs: auth_token -> token

* update from_pretrained.rs file as well

---------

Co-authored-by: Rene Ravenel <rene@Renes-MacBook-Pro.local>
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
2024-10-24 16:32:09 +02:00
51826532d4 push new dev version 2024-10-10 12:00:16 +02:00
3d51a1695f Fix documentation build (#1642)
* use v4

* fix ruff

* style
2024-10-01 14:48:02 +02:00
81c471cf17 update dev version 0.20.0 2024-08-08 18:11:50 +02:00
bfd9cdeefb Perf improvement 16% by removing offsets. (#1587)
* [Breaking Change] Perf improvement 16% by removing offsets.

Offsets calculation are always calculated in Python land.
By changing it to not being calculated, we win 16% of the runtime.

This is not the total extent of it because offsets are
still calculated in bytes.

* Required features.

* Remove clippy error.

* Make it non breaking and still show perf improvement.

* Even faster without offsets.

* Update doc.

* Fmt.

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fmt.

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2024-08-08 14:56:13 +02:00
49dafd707e Fix strip python type (#1602)
* update

* the fix

* Revert "update"

This reverts commit 4c2f32f116479b0ec8ccd7c832f86cbc8787d8a9.

* add a test and rebase

* style

* oups
2024-08-07 15:36:28 +02:00
bded212356 Support None to reset pre_tokenizers and normalizers, and index sequences (#1590)
* initial commit

* support None

* fix clippy

* cleanup

* clean?

* propagate to pre_tokenizer

* fix test

* fix rust tests

* fix node

* propagate to decoder and post processor

* fix calls

* lint

* fmt

* node be happy I am fixing you

* initial commit

* support None

* fix clippy

* cleanup

* clean?

* propagate to pre_tokenizer

* fix test

* fix rust tests

* fix node

* propagate to decoder and post processor

* fix calls

* lint

* fmt

* node be happy I am fixing you

* add a small test

* styling

* style merge

* fix merge test

* fmt

* nits

* update tset
2024-08-07 12:52:35 +02:00
eea8e1ae6f Fix doc about split (#1591)
* update doc

* add example

* Update bindings/python/src/pre_tokenizers.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* stub

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-08-07 12:35:01 +02:00
ab9c7ded8b Using serde (serde_pyo3) to get __str__ and __repr__ easily. (#1588)
* Using serde (serde_pyo3) to get __str__ and __repr__ easily.

* Putting it within tokenizers, it needs to be too specific.

* Clippy is our friend.

* Ruff.

* Update the tests.

* Pretty sure this is wrong (#1589)

* Adding support for ellipsis.

* Fmt.

* Ruff.

* Fixing tokenizer.

---------

Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
2024-08-07 12:08:29 +02:00
a010f6b75c Revert "Using serde (serde_pyo3) to get __str__ and __repr__ easily."
This reverts commit 86138337fc.
2024-08-02 18:42:57 +02:00
86138337fc Using serde (serde_pyo3) to get __str__ and __repr__ easily. 2024-08-02 18:41:54 +02:00
7415e28536 Enabling the option to use fancy_regex instead of onig. 2024-08-01 15:53:17 +02:00
1df498a186 Fixing benchmark2. 2024-08-01 15:52:39 +02:00
c6f2c0b057 Fixing the benchmark. (#1583) 2024-08-01 10:36:53 +02:00
35f338a7b8 Add benchmark vs tiktoken (#1582)
* Adding a simple tiktoken benchmark.

* Adding 1 large fused document case.
2024-07-31 17:09:23 +02:00
4ea2f235b0 Add bytelevel normalizer to fix decode when adding tokens to BPE (#1555)
* feature dependent test

* nit about 嗎

* update

* actuallyfix it

* update the test

add it

fix

* stub

* Update tokenizers/src/pre_tokenizers/byte_level.rs

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* skip failing test

* add normalizer to init

---------

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
2024-07-15 12:12:03 +02:00
fdd26ba9a3 Enable dropout = 0.0 as an equivalent to none in BPE (#1550)
* enable dropout = 0.0

* typo

* lint

* formatter

* enable dropout = 0.0

* formatter
2024-06-24 12:36:11 +02:00
1ff56c0c70 Fix 'dictionnary' typo (#1511) 2024-06-11 15:43:47 +02:00
88f51fe7d2 Switch from cached_download to hf_hub_download in tests (#1547) 2024-06-11 15:26:58 +02:00
f2ec3b239b remove enforcement of non special when adding tokens (#1521)
* remove enforcement of non special when adding tokens

* mut no longer needed

* add a small test

* nit

* style

* audit

* ignore cargo audit's own vulnerability

* update

* revert

* remove CVE
2024-04-30 15:53:47 +02:00
71c2a8d01a update dev version so 0.19.1 2024-04-17 23:17:12 +02:00
91393ef75e Fixing doc. (#1499)
* Fixing doc.

* SentencePieceUnigram  and Convert.py still used sentencepiece

* stub

---------

Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
2024-04-17 09:32:40 +02:00
949d9e3e0e Bumping all versions 3 times (ty transformers :) ) (#1498) 2024-04-16 15:58:36 +02:00
d5a8cc7a49 PyO3 0.21. (#1494)
* PyO3 0.21.

* Upgraded everything.

* Rustfmt.
2024-04-16 13:49:52 +02:00
914576f7ed Add more support for tiktoken based tokenizers (#1493)
* first commit

* update

* clippy

* lint

* clippy and lint

* fmt

* revert print

* 😈

* style

* add a test

* more fmt

* Use ignore_merges

* stub

* fix

* update

* Update tokenizers/src/models/bpe/model.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* update

* rust lint

* dob; t repeat yourself

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-04-15 17:26:36 +02:00
6e58f838b3 version = "0.16.0-dev.0" 2024-04-02 09:51:14 +02:00
09069717e9 Refactor metaspace (#1476)
* version = "0.15.3-dev-0”

Improve performances of meta space, but also just fix it.

(transformers) ➜  transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
[0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848]
(transformers) ➜  transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
[0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]

* well what do we have

* nit

* be BC with non legacy

* unrelated change for clippy

* fix test

* splitting is a must for word_ids

* fmt and lint

* Fixing everything (hopefully better).

* Fixing node.

* Including yarn.lock

* Lint.

* Stubs.

* revert to use split

* fix merge issues

* fix tests

* finish fixing tests

* ruff

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-03-30 10:27:24 +01:00