tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
Arthur Zucker	7c36735389	v0.20.2-dev.0 version	2024-11-04 18:36:40 +01:00
Manish Goregaokar	5512a424bf	Add safety comments (#1651 ) * Unsafe comment for from_u32_unchecked * Add safety comments and type assertion for HashSet parallel iteration * Add safety comment for String splice * fixes * fmt * pos	2024-10-29 09:44:06 +01:00
sftse	6ea758872d	Unsound call of `set_var` (#1664 ) * refactor: lift cloning to caller * refactor: do not elide lifetimes as in Rust 2018 * fix: unsound use of env::set_var, was breaking stdlib change to make unsafe It is generally not safe to set env variables. The correct way to set a config value that needs to be overridden is to hold a copy internal to the library and only read from the environment.	2024-10-25 15:44:30 +02:00
rravenel	a8738a95d1	Arg name correction: auth_token -> token (#1621 ) * Arg name correction: auth_token -> token * Arg name correction in .rs: auth_token -> token * update from_pretrained.rs file as well --------- Co-authored-by: Rene Ravenel <rene@Renes-MacBook-Pro.local> Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>	2024-10-24 16:32:09 +02:00
Ryan Landay	9b77c054ef	Fix off-by-one error in tokenizer::normalizer::Range::len (#1638 )	2024-10-14 08:40:17 +02:00
dependabot[bot]	bce68a60cb	Bump cookie and express in /tokenizers/examples/unstable_wasm/www (#1648 ) Bumps [cookie](https://github.com/jshttp/cookie) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together. Updates `cookie` from 0.6.0 to 0.7.1 - [Release notes](https://github.com/jshttp/cookie/releases) - [Commits](https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.1) Updates `express` from 4.21.0 to 4.21.1 - [Release notes](https://github.com/expressjs/express/releases) - [Changelog](https://github.com/expressjs/express/blob/4.21.1/History.md) - [Commits](https://github.com/expressjs/express/compare/4.21.0...4.21.1) --- updated-dependencies: - dependency-name: cookie dependency-type: indirect - dependency-name: express dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-10 15:30:24 +02:00
Arthur Zucker	51826532d4	push new dev version	2024-10-10 12:00:16 +02:00
Hamir Mahal	557fde76d8	style: simplify string formatting for readability (#1632 )	2024-10-04 13:11:50 +02:00
dependabot[bot]	294ab86fe0	Bump webpack in /tokenizers/examples/unstable_wasm/www (#1641 ) Bumps [webpack](https://github.com/webpack/webpack) from 5.76.0 to 5.95.0. - [Release notes](https://github.com/webpack/webpack/releases) - [Commits](https://github.com/webpack/webpack/compare/v5.76.0...v5.95.0) --- updated-dependencies: - dependency-name: webpack dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-01 14:17:23 +02:00
dependabot[bot]	2204066e78	Bump body-parser and express in /tokenizers/examples/unstable_wasm/www (#1629 ) Bumps [body-parser](https://github.com/expressjs/body-parser) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together. Updates `body-parser` from 1.20.0 to 1.20.3 - [Release notes](https://github.com/expressjs/body-parser/releases) - [Changelog](https://github.com/expressjs/body-parser/blob/master/HISTORY.md) - [Commits](https://github.com/expressjs/body-parser/compare/1.20.0...1.20.3) Updates `express` from 4.18.1 to 4.21.0 - [Release notes](https://github.com/expressjs/express/releases) - [Changelog](https://github.com/expressjs/express/blob/4.21.0/History.md) - [Commits](https://github.com/expressjs/express/compare/4.18.1...4.21.0) --- updated-dependencies: - dependency-name: body-parser dependency-type: indirect - dependency-name: express dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-01 14:16:41 +02:00
Arthur	3fb1371c1c	[`ignore_merges`] Fix offsets (#1640 ) * Fix the default offset create * update the tests * clippy	2024-10-01 09:22:20 +02:00
Arthur Zucker	81c471cf17	update dev version 0.20.0	2024-08-08 18:11:50 +02:00
Nicolas Patry	bfd9cdeefb	Perf improvement 16% by removing offsets. (#1587 ) * [Breaking Change] Perf improvement 16% by removing offsets. Offsets calculation are always calculated in Python land. By changing it to not being calculated, we win 16% of the runtime. This is not the total extent of it because offsets are still calculated in bytes. * Required features. * Remove clippy error. * Make it non breaking and still show perf improvement. * Even faster without offsets. * Update doc. * Fmt. * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fmt. --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2024-08-08 14:56:13 +02:00
Arthur	bd27fa56d6	add deserialize for pre tokenizers (#1603 ) * add deserialize * copy from the decoder * fmt * clippy * fix rust tests * fmt * don't change the test	2024-08-08 08:38:09 +02:00
Nicolas Patry	56c9c70440	Tests + Deserialization improvement for normalizers. (#1604 )	2024-08-08 08:38:02 +02:00
Arthur	bded212356	Support `None` to reset pre_tokenizers and normalizers, and index sequences (#1590 ) * initial commit * support None * fix clippy * cleanup * clean? * propagate to pre_tokenizer * fix test * fix rust tests * fix node * propagate to decoder and post processor * fix calls * lint * fmt * node be happy I am fixing you * initial commit * support None * fix clippy * cleanup * clean? * propagate to pre_tokenizer * fix test * fix rust tests * fix node * propagate to decoder and post processor * fix calls * lint * fmt * node be happy I am fixing you * add a small test * styling * style merge * fix merge test * fmt * nits * update tset	2024-08-07 12:52:35 +02:00
Nicolas Patry	6a5fce9fa0	Merges cannot handle tokens containing spaces. (#909 ) * Merges cannot handle tokens containing spaces. This fixes this while keeping backward support. We don't want to merge that blindly. * Update the tests. * Fixing clippy. * Add a test with spaces in the token/merge.	2024-08-07 12:34:53 +02:00
Nicolas Patry	7a30bca2f3	Updating error messages. (#1599 )	2024-08-06 16:42:56 +02:00
Arthur	8f2cc90249	Add test normalizers (#1600 ) * update * update test they passs * fmt	2024-08-06 16:08:18 +02:00
Nicolas Patry	fe41687ca8	Better serialization error (#1595 ) * Updating the deserialization error for models. * Update tokenizers/src/models/mod.rs Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2024-08-06 13:39:11 +02:00
Nicolas Patry	2d27761f60	Adding a few tests for decoder deserialization.	2024-08-06 13:36:36 +02:00
Arthur	adc82cb49a	Add-legacy-tests (#1597 ) * add tests * decoder as well * check error * propagate * lint * rafiune the test * lint * revert decoder changes * on more? * fmt * Update tokenizers/src/pre_tokenizers/mod.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * fix commit * simplify err * fmt --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-06 13:08:12 +02:00
Nicolas Patry	99a48dcb46	Clippy.	2024-08-06 10:48:39 +02:00
Nicolas Patry	5fb8a2320c	Legacy test.	2024-08-06 10:48:39 +02:00
Nicolas Patry	388014fd6b	Adding some serialization testing around the wrapper.	2024-08-06 10:48:39 +02:00
Nicolas Patry	7415e28536	Enabling the option to use fancy_regex instead of `onig`.	2024-08-01 15:53:17 +02:00
Nicolas Patry	9e0c791f2b	Small performance fixup (negligible but obviously better).	2024-08-01 15:52:39 +02:00
Mike	aface7a968	dump spm_precompiled to 0.1.3 (#1571 )	2024-07-31 15:38:04 +02:00
Nicolas Patry	a3ad85b3e8	Fix clippy + feature test management. (#1580 ) * Fix clippy + feature test management. * That example was local oops. * CLippy fix. * Readme indentation. * README update.	2024-07-26 12:16:30 +02:00
Arthur	4ea2f235b0	Add bytelevel normalizer to fix decode when adding tokens to BPE (#1555 ) * feature dependent test * nit about 嗎 * update * actuallyfix it * update the test add it fix * stub * Update tokenizers/src/pre_tokenizers/byte_level.rs Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> * skip failing test * add normalizer to init --------- Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>	2024-07-15 12:12:03 +02:00
Arthur	f2a44dc5d1	Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … (#1569 ) * Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)" This reverts commit `25aee8b88c`. * don't remove audit * deprecate id_to_token * use simple id to token * don't break id_to_token since we are deprecating anyways?	2024-07-12 07:29:40 +02:00
Marco	fdd26ba9a3	Enable `dropout = 0.0` as an equivalent to `none` in BPE (#1550 ) * enable dropout = 0.0 * typo * lint * formatter * enable dropout = 0.0 * formatter	2024-06-24 12:36:11 +02:00
Arthur	9441f7e8f7	make sure we don't warn on empty tokens (#1554 ) * make sure we don't warn on empty tokens * Testing the log is actually hard 😓 * mpty	2024-06-20 14:33:21 +02:00
Arthur Zucker	3e736bbccb	Fix clippy	2024-06-20 09:39:19 +02:00
Nicolas Patry	8d28dbefd1	Fixing for clippy 1.78 (#1548 )	2024-06-06 13:18:59 +02:00
nathaniel-daniel	bfefcf676d	Make USED_PARALLELISM atomic (#1532 )	2024-06-06 13:02:26 +02:00
Nicolas Patry	25aee8b88c	[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513 ) * [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder Causes issues with `ByteLevel` messing up some `AddedTokens` with some utf-8 range used in the bytelevel mapping. This commit tests the extend of the damage of ignoring the decoder for those tokens. * Format. * Installing cargo audit. * Minor fix. * Fixing "bug" in node/python. * Autoformat. * Clippy. * Only prefix space when there's no decoder.	2024-05-06 11:49:38 +02:00
Arthur Zucker	71c2a8d01a	update dev version so 0.19.1	2024-04-17 23:17:12 +02:00
Arthur	7733bc25d6	add serialization for `ignore_merges` (#1504 ) * add serialization for `ignore_merges` * add serialization tests * deserialize without `ignore_merges`	2024-04-17 21:56:48 +02:00
Nicolas Patry	949d9e3e0e	Bumping all versions 3 times (ty transformers :) ) (#1498 )	2024-04-16 15:58:36 +02:00
Nicolas Patry	d5a8cc7a49	PyO3 0.21. (#1494 ) * PyO3 0.21. * Upgraded everything. * Rustfmt.	2024-04-16 13:49:52 +02:00
Arthur	914576f7ed	Add more support for tiktoken based tokenizers (#1493 ) * first commit * update * clippy * lint * clippy and lint * fmt * revert print * 😈 * style * add a test * more fmt * Use ignore_merges * stub * fix * update * Update tokenizers/src/models/bpe/model.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * update * rust lint * dob; t repeat yourself --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-04-15 17:26:36 +02:00
Arthur Zucker	6e58f838b3	version = "0.16.0-dev.0"	2024-04-02 09:51:14 +02:00
Arthur	09069717e9	Refactor metaspace (#1476 ) * version = "0.15.3-dev-0” Improve performances of meta space, but also just fix it. (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] ['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] [0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848] (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] ['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] [0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416] * well what do we have * nit * be BC with non legacy * unrelated change for clippy * fix test * splitting is a must for word_ids * fmt and lint * Fixing everything (hopefully better). * Fixing node. * Including yarn.lock * Lint. * Stubs. * revert to use split * fix merge issues * fix tests * finish fixing tests * ruff --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-03-30 10:27:24 +01:00
Anthony Platanios	6153126b22	Added ability to inspect a 'Sequence' decoder and the `AddedVocabulary`. (#1443 ) * Fixes. * Fixes.	2024-03-30 00:29:54 +01:00
Bryant Biggs	72a1973cd1	chore: Remove CLI - this was originally intended for local development (#1442 )	2024-02-13 04:05:43 +01:00
Arthur Zucker	7f49f20ab0	version = "0.15.3-dev-0”	2024-02-12 09:48:00 +09:00
Rasmus Larsen	c893204c45	Efficient Replace normalizer (#1413 ) * new Replace work * clean up * clean up * typo * cargo fmt * Clippy. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-02-06 14:36:44 +01:00
Stephen Roller	4a8105c366	Convert word counts to u64 (#1433 ) * Convert word counts to u64 * More spots needed to compile	2024-02-06 03:39:12 +01:00
Bryant Biggs	67fe59c88d	chore: Update dependencies to latest supported versions (#1441 )	2024-01-22 17:54:37 +01:00

1 2 3 4 5 ...

804 Commits