tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
Nicolas Patry	6a5fce9fa0	Merges cannot handle tokens containing spaces. (#909 ) * Merges cannot handle tokens containing spaces. This fixes this while keeping backward support. We don't want to merge that blindly. * Update the tests. * Fixing clippy. * Add a test with spaces in the token/merge.	2024-08-07 12:34:53 +02:00
Nicolas Patry	ab9c7ded8b	Using serde (serde_pyo3) to get __str__ and __repr__ easily. (#1588 ) * Using serde (serde_pyo3) to get __str__ and __repr__ easily. * Putting it within tokenizers, it needs to be too specific. * Clippy is our friend. * Ruff. * Update the tests. * Pretty sure this is wrong (#1589) * Adding support for ellipsis. * Fmt. * Ruff. * Fixing tokenizer. --------- Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>	2024-08-07 12:08:29 +02:00
Nicolas Patry	7a30bca2f3	Updating error messages. (#1599 )	2024-08-06 16:42:56 +02:00
Arthur	8f2cc90249	Add test normalizers (#1600 ) * update * update test they passs * fmt	2024-08-06 16:08:18 +02:00
Nicolas Patry	fe41687ca8	Better serialization error (#1595 ) * Updating the deserialization error for models. * Update tokenizers/src/models/mod.rs Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2024-08-06 13:39:11 +02:00
Nicolas Patry	2d27761f60	Adding a few tests for decoder deserialization.	2024-08-06 13:36:36 +02:00
Arthur	adc82cb49a	Add-legacy-tests (#1597 ) * add tests * decoder as well * check error * propagate * lint * rafiune the test * lint * revert decoder changes * on more? * fmt * Update tokenizers/src/pre_tokenizers/mod.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * fix commit * simplify err * fmt --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-06 13:08:12 +02:00
Nicolas Patry	99a48dcb46	Clippy.	2024-08-06 10:48:39 +02:00
Nicolas Patry	5fb8a2320c	Legacy test.	2024-08-06 10:48:39 +02:00
Nicolas Patry	388014fd6b	Adding some serialization testing around the wrapper.	2024-08-06 10:48:39 +02:00
Nicolas Patry	7b80359dd2	Fixing release CI strict (taken from safetensors).	2024-08-06 09:11:30 +02:00
Nicolas Patry	a010f6b75c	Revert "Using serde (serde_pyo3) to get __str__ and __repr__ easily." This reverts commit `86138337fc`.	2024-08-02 18:42:57 +02:00
Nicolas Patry	86138337fc	Using serde (serde_pyo3) to get __str__ and __repr__ easily.	2024-08-02 18:41:54 +02:00
Nicolas Patry	7415e28536	Enabling the option to use fancy_regex instead of `onig`.	2024-08-01 15:53:17 +02:00
Nicolas Patry	9e0c791f2b	Small performance fixup (negligible but obviously better).	2024-08-01 15:52:39 +02:00
Nicolas Patry	1df498a186	Fixing benchmark2.	2024-08-01 15:52:39 +02:00
Nicolas Patry	c6f2c0b057	Fixing the benchmark. (#1583 )	2024-08-01 10:36:53 +02:00
Nicolas Patry	35f338a7b8	Add benchmark vs tiktoken (#1582 ) * Adding a simple tiktoken benchmark. * Adding 1 large fused document case.	2024-07-31 17:09:23 +02:00
Mike	aface7a968	dump spm_precompiled to 0.1.3 (#1571 )	2024-07-31 15:38:04 +02:00
Nicolas Patry	a3ad85b3e8	Fix clippy + feature test management. (#1580 ) * Fix clippy + feature test management. * That example was local oops. * CLippy fix. * Readme indentation. * README update.	2024-07-26 12:16:30 +02:00
Arthur	4ea2f235b0	Add bytelevel normalizer to fix decode when adding tokens to BPE (#1555 ) * feature dependent test * nit about 嗎 * update * actuallyfix it * update the test add it fix * stub * Update tokenizers/src/pre_tokenizers/byte_level.rs Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> * skip failing test * add normalizer to init --------- Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>	2024-07-15 12:12:03 +02:00
Arthur	f2a44dc5d1	Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … (#1569 ) * Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)" This reverts commit `25aee8b88c`. * don't remove audit * deprecate id_to_token * use simple id to token * don't break id_to_token since we are deprecating anyways?	2024-07-12 07:29:40 +02:00
Marco	fdd26ba9a3	Enable `dropout = 0.0` as an equivalent to `none` in BPE (#1550 ) * enable dropout = 0.0 * typo * lint * formatter * enable dropout = 0.0 * formatter	2024-06-24 12:36:11 +02:00
Arthur	9441f7e8f7	make sure we don't warn on empty tokens (#1554 ) * make sure we don't warn on empty tokens * Testing the log is actually hard 😓 * mpty	2024-06-20 14:33:21 +02:00
Arthur Zucker	3e736bbccb	Fix clippy	2024-06-20 09:39:19 +02:00
Nathan	1ff56c0c70	Fix 'dictionnary' typo (#1511 )	2024-06-11 15:43:47 +02:00
Lucain	88f51fe7d2	Switch from cached_download to hf_hub_download in tests (#1547 )	2024-06-11 15:26:58 +02:00
Luc Georges	418c35c09e	feat(ci): add trufflehog secrets detection (#1551 ) * feat(ci): add trufflehog secrets detection * fix(ci): remove unnecessary permissions	2024-06-10 16:10:23 +02:00
Nicolas Patry	8d28dbefd1	Fixing for clippy 1.78 (#1548 )	2024-06-06 13:18:59 +02:00
nathaniel-daniel	bfefcf676d	Make USED_PARALLELISM atomic (#1532 )	2024-06-06 13:02:26 +02:00
Nicolas Patry	25aee8b88c	[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513 ) * [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder Causes issues with `ByteLevel` messing up some `AddedTokens` with some utf-8 range used in the bytelevel mapping. This commit tests the extend of the damage of ignoring the decoder for those tokens. * Format. * Installing cargo audit. * Minor fix. * Fixing "bug" in node/python. * Autoformat. * Clippy. * Only prefix space when there's no decoder.	2024-05-06 11:49:38 +02:00
Arthur	f2ec3b239b	remove enforcement of non special when adding tokens (#1521 ) * remove enforcement of non special when adding tokens * mut no longer needed * add a small test * nit * style * audit * ignore cargo audit's own vulnerability * update * revert * remove CVE	2024-04-30 15:53:47 +02:00
Arthur Zucker	71c2a8d01a	update dev version so 0.19.1	2024-04-17 23:17:12 +02:00
Arthur	7733bc25d6	add serialization for `ignore_merges` (#1504 ) * add serialization for `ignore_merges` * add serialization tests * deserialize without `ignore_merges`	2024-04-17 21:56:48 +02:00
Nicolas Patry	91393ef75e	Fixing doc. (#1499 ) * Fixing doc. * SentencePieceUnigram and Convert.py still used sentencepiece * stub --------- Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>	2024-04-17 09:32:40 +02:00
Nicolas Patry	949d9e3e0e	Bumping all versions 3 times (ty transformers :) ) (#1498 )	2024-04-16 15:58:36 +02:00
Nicolas Patry	e0defa7355	Remove 3.13 (potential undefined behavior.) (#1497 )	2024-04-16 15:56:47 +02:00
Nicolas Patry	d5a8cc7a49	PyO3 0.21. (#1494 ) * PyO3 0.21. * Upgraded everything. * Rustfmt.	2024-04-16 13:49:52 +02:00
Arthur	914576f7ed	Add more support for tiktoken based tokenizers (#1493 ) * first commit * update * clippy * lint * clippy and lint * fmt * revert print * 😈 * style * add a test * more fmt * Use ignore_merges * stub * fix * update * Update tokenizers/src/models/bpe/model.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * update * rust lint * dob; t repeat yourself --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-04-15 17:26:36 +02:00
Arthur Zucker	6e58f838b3	version = "0.16.0-dev.0"	2024-04-02 09:51:14 +02:00
Arthur	09069717e9	Refactor metaspace (#1476 ) * version = "0.15.3-dev-0” Improve performances of meta space, but also just fix it. (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] ['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] [0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848] (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] ['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] [0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416] * well what do we have * nit * be BC with non legacy * unrelated change for clippy * fix test * splitting is a must for word_ids * fmt and lint * Fixing everything (hopefully better). * Fixing node. * Including yarn.lock * Lint. * Stubs. * revert to use split * fix merge issues * fix tests * finish fixing tests * ruff --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-03-30 10:27:24 +01:00
Anthony Platanios	6153126b22	Added ability to inspect a 'Sequence' decoder and the `AddedVocabulary`. (#1443 ) * Fixes. * Fixes.	2024-03-30 00:29:54 +01:00
dependabot[bot]	d8c4388166	Bump ip from 2.0.0 to 2.0.1 in /bindings/node (#1456 ) Bumps [ip](https://github.com/indutny/node-ip) from 2.0.0 to 2.0.1. - [Commits](https://github.com/indutny/node-ip/compare/v2.0.0...v2.0.1) --- updated-dependencies: - dependency-name: ip dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-03-25 11:29:36 +01:00
Arthur	29fef1e7aa	[`remove black`] And use ruff (#1436 ) * nits * Fixing deps. * Ruff update. * Import order matters. * Fix. * Revert ruff fix. * Visualizer. * Putting back the imports. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-03-12 11:24:21 +01:00
Bryant Biggs	72a1973cd1	chore: Remove CLI - this was originally intended for local development (#1442 )	2024-02-13 04:05:43 +01:00
Arthur Zucker	7f49f20ab0	version = "0.15.3-dev-0”	2024-02-12 09:48:00 +09:00
Rasmus Larsen	c893204c45	Efficient Replace normalizer (#1413 ) * new Replace work * clean up * clean up * typo * cargo fmt * Clippy. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-02-06 14:36:44 +01:00
Stephen Roller	4a8105c366	Convert word counts to u64 (#1433 ) * Convert word counts to u64 * More spots needed to compile	2024-02-06 03:39:12 +01:00
Bryant Biggs	67fe59c88d	chore: Update dependencies to latest supported versions (#1441 )	2024-01-22 17:54:37 +01:00
Arthur Zucker	8f73fe9515	update dev version to 0.15.2-dev.0	2024-01-22 15:34:57 +01:00

1 2 3 4 5 ...

1853 Commits