tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-03 03:08:21 +00:00

Author	SHA1	Message	Date
Qubitium-ModelCloud	e5d781d5b9	update pyo3 and rust-numpy depends for no-gil/free-threading compat (#1774 ) Signed-off-by: root <root@gpu-xl.lxd> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2025-05-27 11:31:58 +02:00
Arthur	01f8bc834c	clippy (#1781 ) * clippy * fmtr * rutc? * fix onig issue * up * decode stream default * jump a release for cargo audit ... * more cliippy stuff * clippy? * proper style * fmt	2025-05-27 11:30:32 +02:00
Gaétan Lepage	23e7e42adf	Fix data path in test_continuing_prefix_trainer_mismatch (#1747 )	2025-05-27 08:48:27 +02:00
Kokū	cc01186fd7	Fix type notation of merges in BPE Python binding (#1766 )	2025-05-27 08:23:58 +02:00
co63oc	f1faec1756	Fix typos in strings and comments (#1770 )	2025-05-27 08:17:36 +02:00
Nicolas Patry	4383a25787	Update the release builds following 0.21.1. (#1746 ) * Update the release builds following 0.21.1. * Clippy fix.	2025-03-13 13:01:41 +01:00
Nighthawk	fbe3365a13	Update metadata as Python3.7 and Python3.8 support was dropped (#1724 ) * Update metadata as python3.7 and python3.8 support was dropped * Format pyproject.toml: unify quotes and indentation	2025-02-11 10:52:59 +01:00
Arthur	c45aebd102	🚨 Support updating template processors (#1652 ) * current updates * simplify * set_item works, but `tokenizer._tokenizer.post_processor[1].single = ["$0", "</s>"]` does not ! * fix: `normalizers` deserialization and other refactoring * fix: `pre_tokenizer` deserialization * feat: add `__len__` implementation for `normalizer::PySequence` * feat: add `__setitem__` impl for `normalizers::PySequence` * feat: add `__setitem__` impl to `pre_tokenizer::PySequence` * feat: add `__setitem__` impl to `post_processor::PySequence` * test: add normalizer sequence setter check * refactor: allow unused `processors::setter` macro * test: add `__setitem__` test for processors & pretok * refactor: `unwrap` -> `PyException::new_err()?` * refactor: fmt * refactor: remove unnecessary `pub` * feat(bindings): add missing getters & setters for pretoks * feat(bindings): add missing getters & setters for processors * refactor(bindings): rewrite RwLock poison error msg * refactor: remove debug print * feat(bindings): add description as to why custom deser is needed * feat: make post proc sequence elements mutable * fix(binding): serialization --------- Co-authored-by: Luc Georges <luc.sydney.georges@gmail.com>	2025-01-28 14:58:35 +01:00
Nicolas Patry	0ff2ab0f64	Fixing the stream by removing the read_index altogether. (#1716 ) * Fixing the stream by removing the read_index altogether. * Moving the test location because.. Windows. * Ok whatever. * Rust 1.84 * Fmt.	2025-01-09 17:41:15 +01:00
tinyboxvk	bdfc38b78d	Fix typos (#1715 ) * Fix typos Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com> * Update docs/source/quicktour.rst * Update docs/source-doc-builder/quicktour.mdx --------- Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2025-01-09 11:53:20 +01:00
Dylan-Harden3	6945933829	update Split pretokenizer docstrings (#1701 )	2025-01-08 12:35:52 +01:00
Nicolas Patry	3a6504d274	Upgrade to PyO3 0.23 (#1708 ) * Upgrade to PyO3 0.23 * Macos-12 deprecated? * Clippy. * Clippy auto ellision.	2024-12-31 18:36:01 +01:00
Arthur	24d29f498d	Update dev version and pyproject.toml (#1693 ) * update pyproject.toml * update py dev version	2024-11-27 16:01:48 +01:00
Arthur Zucker	1bf2a66b80	v0.20.4-dev0	2024-11-27 10:07:49 +01:00
Dimitris Iliopoulos	ac34660e44	Fix encode_batch and encode_batch_fast to accept ndarrays again (#1679 ) * Fix encode_batch and encode_batch_fast to accept ndarrays again * Fix clippy --------- Co-authored-by: Dimitris Iliopoulos <diliopoulos@fb.com>	2024-11-21 11:55:11 +01:00
Nicolas Patry	cc5fb01a2f	Decode stream python (#1678 ) * Python binding for decode stream Different API because Python cannot handle lifetimes properly. * Clippy.	2024-11-15 12:06:22 +01:00
Nicolas Patry	f4c9fd7f40	Testing ABI3 wheels to reduce number of wheels (#1674 ) * Testing ABI3 wheels to reduce number of wheels * No need for py-clone anymore. * Upgrade python versions. * Remove those flakes. * Promoting new CI + Fixing secret.	2024-11-15 06:02:22 +01:00
Nicolas Patry	c6b5c3eab7	More cache options. (#1675 ) * More cache options. * Fixing error messages.	2024-11-06 11:12:09 +01:00
Christopher Akiki	57884ebaa2	[MINOR:TYPO] Fix docstrings (#1653 ) * [MINOR:TYPO] Update pre_tokenizers.rs * [MINOR:TYPO] Update __init__.pyi	2024-11-05 16:25:06 +01:00
Arthur	5e223ceb48	fix pylist (#1673 ) * fix pylist * add comment about why we use PySequence * style * fix encode batch fast as well * Update bindings/python/src/tokenizer.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * fix with capacity * stub :) --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-11-05 16:24:23 +01:00
Arthur Zucker	7c36735389	v0.20.2-dev.0 version	2024-11-04 18:36:40 +01:00
Dimitris Iliopoulos	6ade8c2d21	PyO3 0.22 (#1665 ) * PyO3 0.22 * Fix python stubs * Remove name arg from PyModel::save Python signature --------- Co-authored-by: Dimitris Iliopoulos <diliopoulos@fb.com>	2024-11-01 10:17:23 +01:00
sftse	6ea758872d	Unsound call of `set_var` (#1664 ) * refactor: lift cloning to caller * refactor: do not elide lifetimes as in Rust 2018 * fix: unsound use of env::set_var, was breaking stdlib change to make unsafe It is generally not safe to set env variables. The correct way to set a config value that needs to be overridden is to hold a copy internal to the library and only read from the environment.	2024-10-25 15:44:30 +02:00
rravenel	a8738a95d1	Arg name correction: auth_token -> token (#1621 ) * Arg name correction: auth_token -> token * Arg name correction in .rs: auth_token -> token * update from_pretrained.rs file as well --------- Co-authored-by: Rene Ravenel <rene@Renes-MacBook-Pro.local> Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>	2024-10-24 16:32:09 +02:00
Arthur Zucker	51826532d4	push new dev version	2024-10-10 12:00:16 +02:00
Arthur	3d51a1695f	Fix documentation build (#1642 ) * use v4 * fix ruff * style	2024-10-01 14:48:02 +02:00
Arthur Zucker	81c471cf17	update dev version 0.20.0	2024-08-08 18:11:50 +02:00
Nicolas Patry	bfd9cdeefb	Perf improvement 16% by removing offsets. (#1587 ) * [Breaking Change] Perf improvement 16% by removing offsets. Offsets calculation are always calculated in Python land. By changing it to not being calculated, we win 16% of the runtime. This is not the total extent of it because offsets are still calculated in bytes. * Required features. * Remove clippy error. * Make it non breaking and still show perf improvement. * Even faster without offsets. * Update doc. * Fmt. * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fmt. --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2024-08-08 14:56:13 +02:00
Arthur	49dafd707e	Fix strip python type (#1602 ) * update * the fix * Revert "update" This reverts commit 4c2f32f116479b0ec8ccd7c832f86cbc8787d8a9. * add a test and rebase * style * oups	2024-08-07 15:36:28 +02:00
Arthur	bded212356	Support `None` to reset pre_tokenizers and normalizers, and index sequences (#1590 ) * initial commit * support None * fix clippy * cleanup * clean? * propagate to pre_tokenizer * fix test * fix rust tests * fix node * propagate to decoder and post processor * fix calls * lint * fmt * node be happy I am fixing you * initial commit * support None * fix clippy * cleanup * clean? * propagate to pre_tokenizer * fix test * fix rust tests * fix node * propagate to decoder and post processor * fix calls * lint * fmt * node be happy I am fixing you * add a small test * styling * style merge * fix merge test * fmt * nits * update tset	2024-08-07 12:52:35 +02:00
Arthur	eea8e1ae6f	Fix doc about split (#1591 ) * update doc * add example * Update bindings/python/src/pre_tokenizers.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * stub --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-07 12:35:01 +02:00
Nicolas Patry	ab9c7ded8b	Using serde (serde_pyo3) to get __str__ and __repr__ easily. (#1588 ) * Using serde (serde_pyo3) to get __str__ and __repr__ easily. * Putting it within tokenizers, it needs to be too specific. * Clippy is our friend. * Ruff. * Update the tests. * Pretty sure this is wrong (#1589) * Adding support for ellipsis. * Fmt. * Ruff. * Fixing tokenizer. --------- Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>	2024-08-07 12:08:29 +02:00
Nicolas Patry	a010f6b75c	Revert "Using serde (serde_pyo3) to get __str__ and __repr__ easily." This reverts commit `86138337fc`.	2024-08-02 18:42:57 +02:00
Nicolas Patry	86138337fc	Using serde (serde_pyo3) to get __str__ and __repr__ easily.	2024-08-02 18:41:54 +02:00
Nicolas Patry	7415e28536	Enabling the option to use fancy_regex instead of `onig`.	2024-08-01 15:53:17 +02:00
Nicolas Patry	1df498a186	Fixing benchmark2.	2024-08-01 15:52:39 +02:00
Nicolas Patry	c6f2c0b057	Fixing the benchmark. (#1583 )	2024-08-01 10:36:53 +02:00
Nicolas Patry	35f338a7b8	Add benchmark vs tiktoken (#1582 ) * Adding a simple tiktoken benchmark. * Adding 1 large fused document case.	2024-07-31 17:09:23 +02:00
Arthur	4ea2f235b0	Add bytelevel normalizer to fix decode when adding tokens to BPE (#1555 ) * feature dependent test * nit about 嗎 * update * actuallyfix it * update the test add it fix * stub * Update tokenizers/src/pre_tokenizers/byte_level.rs Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> * skip failing test * add normalizer to init --------- Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>	2024-07-15 12:12:03 +02:00
Marco	fdd26ba9a3	Enable `dropout = 0.0` as an equivalent to `none` in BPE (#1550 ) * enable dropout = 0.0 * typo * lint * formatter * enable dropout = 0.0 * formatter	2024-06-24 12:36:11 +02:00
Nathan	1ff56c0c70	Fix 'dictionnary' typo (#1511 )	2024-06-11 15:43:47 +02:00
Lucain	88f51fe7d2	Switch from cached_download to hf_hub_download in tests (#1547 )	2024-06-11 15:26:58 +02:00
Arthur	f2ec3b239b	remove enforcement of non special when adding tokens (#1521 ) * remove enforcement of non special when adding tokens * mut no longer needed * add a small test * nit * style * audit * ignore cargo audit's own vulnerability * update * revert * remove CVE	2024-04-30 15:53:47 +02:00
Arthur Zucker	71c2a8d01a	update dev version so 0.19.1	2024-04-17 23:17:12 +02:00
Nicolas Patry	91393ef75e	Fixing doc. (#1499 ) * Fixing doc. * SentencePieceUnigram and Convert.py still used sentencepiece * stub --------- Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>	2024-04-17 09:32:40 +02:00
Nicolas Patry	949d9e3e0e	Bumping all versions 3 times (ty transformers :) ) (#1498 )	2024-04-16 15:58:36 +02:00
Nicolas Patry	d5a8cc7a49	PyO3 0.21. (#1494 ) * PyO3 0.21. * Upgraded everything. * Rustfmt.	2024-04-16 13:49:52 +02:00
Arthur	914576f7ed	Add more support for tiktoken based tokenizers (#1493 ) * first commit * update * clippy * lint * clippy and lint * fmt * revert print * 😈 * style * add a test * more fmt * Use ignore_merges * stub * fix * update * Update tokenizers/src/models/bpe/model.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * update * rust lint * dob; t repeat yourself --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-04-15 17:26:36 +02:00
Arthur Zucker	6e58f838b3	version = "0.16.0-dev.0"	2024-04-02 09:51:14 +02:00
Arthur	09069717e9	Refactor metaspace (#1476 ) * version = "0.15.3-dev-0” Improve performances of meta space, but also just fix it. (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] ['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] [0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848] (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] ['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] [0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416] * well what do we have * nit * be BC with non legacy * unrelated change for clippy * fix test * splitting is a must for word_ids * fmt and lint * Fixing everything (hopefully better). * Fixing node. * Including yarn.lock * Lint. * Stubs. * revert to use split * fix merge issues * fix tests * finish fixing tests * ruff --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-03-30 10:27:24 +01:00

1 2 3 4 5 ...

797 Commits