tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
Arthur	3fb1371c1c	[`ignore_merges`] Fix offsets (#1640 ) * Fix the default offset create * update the tests * clippy	2024-10-01 09:22:20 +02:00
dependabot[bot]	b4a38c4f63	Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows (#1626 ) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 3 to 4.1.7. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](https://github.com/actions/download-artifact/compare/v3...v4.1.7) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-09-30 16:38:28 +02:00
152334H	14a07b06e4	fix filelink (#1610 )	2024-08-12 07:35:33 +02:00
Arthur	75aef5b75b	Update README.md (#1608 )	2024-08-09 10:40:21 +02:00
Arthur Zucker	81c471cf17	update dev version 0.20.0	2024-08-08 18:11:50 +02:00
Nicolas Patry	85cc05a32f	Fix CI (#1607 )	2024-08-08 17:09:30 +02:00
Nicolas Patry	bfd9cdeefb	Perf improvement 16% by removing offsets. (#1587 ) * [Breaking Change] Perf improvement 16% by removing offsets. Offsets calculation are always calculated in Python land. By changing it to not being calculated, we win 16% of the runtime. This is not the total extent of it because offsets are still calculated in bytes. * Required features. * Remove clippy error. * Make it non breaking and still show perf improvement. * Even faster without offsets. * Update doc. * Fmt. * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fmt. --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2024-08-08 14:56:13 +02:00
Arthur	bd27fa56d6	add deserialize for pre tokenizers (#1603 ) * add deserialize * copy from the decoder * fmt * clippy * fix rust tests * fmt * don't change the test	2024-08-08 08:38:09 +02:00
Nicolas Patry	56c9c70440	Tests + Deserialization improvement for normalizers. (#1604 )	2024-08-08 08:38:02 +02:00
Arthur	49dafd707e	Fix strip python type (#1602 ) * update * the fix * Revert "update" This reverts commit 4c2f32f116479b0ec8ccd7c832f86cbc8787d8a9. * add a test and rebase * style * oups	2024-08-07 15:36:28 +02:00
Arthur	bded212356	Support `None` to reset pre_tokenizers and normalizers, and index sequences (#1590 ) * initial commit * support None * fix clippy * cleanup * clean? * propagate to pre_tokenizer * fix test * fix rust tests * fix node * propagate to decoder and post processor * fix calls * lint * fmt * node be happy I am fixing you * initial commit * support None * fix clippy * cleanup * clean? * propagate to pre_tokenizer * fix test * fix rust tests * fix node * propagate to decoder and post processor * fix calls * lint * fmt * node be happy I am fixing you * add a small test * styling * style merge * fix merge test * fmt * nits * update tset	2024-08-07 12:52:35 +02:00
Arthur	eea8e1ae6f	Fix doc about split (#1591 ) * update doc * add example * Update bindings/python/src/pre_tokenizers.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * stub --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-07 12:35:01 +02:00
Nicolas Patry	6a5fce9fa0	Merges cannot handle tokens containing spaces. (#909 ) * Merges cannot handle tokens containing spaces. This fixes this while keeping backward support. We don't want to merge that blindly. * Update the tests. * Fixing clippy. * Add a test with spaces in the token/merge.	2024-08-07 12:34:53 +02:00
Nicolas Patry	ab9c7ded8b	Using serde (serde_pyo3) to get __str__ and __repr__ easily. (#1588 ) * Using serde (serde_pyo3) to get __str__ and __repr__ easily. * Putting it within tokenizers, it needs to be too specific. * Clippy is our friend. * Ruff. * Update the tests. * Pretty sure this is wrong (#1589) * Adding support for ellipsis. * Fmt. * Ruff. * Fixing tokenizer. --------- Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>	2024-08-07 12:08:29 +02:00
Nicolas Patry	7a30bca2f3	Updating error messages. (#1599 )	2024-08-06 16:42:56 +02:00
Arthur	8f2cc90249	Add test normalizers (#1600 ) * update * update test they passs * fmt	2024-08-06 16:08:18 +02:00
Nicolas Patry	fe41687ca8	Better serialization error (#1595 ) * Updating the deserialization error for models. * Update tokenizers/src/models/mod.rs Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2024-08-06 13:39:11 +02:00
Nicolas Patry	2d27761f60	Adding a few tests for decoder deserialization.	2024-08-06 13:36:36 +02:00
Arthur	adc82cb49a	Add-legacy-tests (#1597 ) * add tests * decoder as well * check error * propagate * lint * rafiune the test * lint * revert decoder changes * on more? * fmt * Update tokenizers/src/pre_tokenizers/mod.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * fix commit * simplify err * fmt --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-06 13:08:12 +02:00
Nicolas Patry	99a48dcb46	Clippy.	2024-08-06 10:48:39 +02:00
Nicolas Patry	5fb8a2320c	Legacy test.	2024-08-06 10:48:39 +02:00
Nicolas Patry	388014fd6b	Adding some serialization testing around the wrapper.	2024-08-06 10:48:39 +02:00
Nicolas Patry	7b80359dd2	Fixing release CI strict (taken from safetensors).	2024-08-06 09:11:30 +02:00
Nicolas Patry	a010f6b75c	Revert "Using serde (serde_pyo3) to get __str__ and __repr__ easily." This reverts commit `86138337fc`.	2024-08-02 18:42:57 +02:00
Nicolas Patry	86138337fc	Using serde (serde_pyo3) to get __str__ and __repr__ easily.	2024-08-02 18:41:54 +02:00
Nicolas Patry	7415e28536	Enabling the option to use fancy_regex instead of `onig`.	2024-08-01 15:53:17 +02:00
Nicolas Patry	9e0c791f2b	Small performance fixup (negligible but obviously better).	2024-08-01 15:52:39 +02:00
Nicolas Patry	1df498a186	Fixing benchmark2.	2024-08-01 15:52:39 +02:00
Nicolas Patry	c6f2c0b057	Fixing the benchmark. (#1583 )	2024-08-01 10:36:53 +02:00
Nicolas Patry	35f338a7b8	Add benchmark vs tiktoken (#1582 ) * Adding a simple tiktoken benchmark. * Adding 1 large fused document case.	2024-07-31 17:09:23 +02:00
Mike	aface7a968	dump spm_precompiled to 0.1.3 (#1571 )	2024-07-31 15:38:04 +02:00
Nicolas Patry	a3ad85b3e8	Fix clippy + feature test management. (#1580 ) * Fix clippy + feature test management. * That example was local oops. * CLippy fix. * Readme indentation. * README update.	2024-07-26 12:16:30 +02:00
Arthur	4ea2f235b0	Add bytelevel normalizer to fix decode when adding tokens to BPE (#1555 ) * feature dependent test * nit about 嗎 * update * actuallyfix it * update the test add it fix * stub * Update tokenizers/src/pre_tokenizers/byte_level.rs Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> * skip failing test * add normalizer to init --------- Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>	2024-07-15 12:12:03 +02:00
Arthur	f2a44dc5d1	Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … (#1569 ) * Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)" This reverts commit `25aee8b88c`. * don't remove audit * deprecate id_to_token * use simple id to token * don't break id_to_token since we are deprecating anyways?	2024-07-12 07:29:40 +02:00
Marco	fdd26ba9a3	Enable `dropout = 0.0` as an equivalent to `none` in BPE (#1550 ) * enable dropout = 0.0 * typo * lint * formatter * enable dropout = 0.0 * formatter	2024-06-24 12:36:11 +02:00
Arthur	9441f7e8f7	make sure we don't warn on empty tokens (#1554 ) * make sure we don't warn on empty tokens * Testing the log is actually hard 😓 * mpty	2024-06-20 14:33:21 +02:00
Arthur Zucker	3e736bbccb	Fix clippy	2024-06-20 09:39:19 +02:00
Nathan	1ff56c0c70	Fix 'dictionnary' typo (#1511 )	2024-06-11 15:43:47 +02:00
Lucain	88f51fe7d2	Switch from cached_download to hf_hub_download in tests (#1547 )	2024-06-11 15:26:58 +02:00
Luc Georges	418c35c09e	feat(ci): add trufflehog secrets detection (#1551 ) * feat(ci): add trufflehog secrets detection * fix(ci): remove unnecessary permissions	2024-06-10 16:10:23 +02:00
Nicolas Patry	8d28dbefd1	Fixing for clippy 1.78 (#1548 )	2024-06-06 13:18:59 +02:00
nathaniel-daniel	bfefcf676d	Make USED_PARALLELISM atomic (#1532 )	2024-06-06 13:02:26 +02:00
Nicolas Patry	25aee8b88c	[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513 ) * [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder Causes issues with `ByteLevel` messing up some `AddedTokens` with some utf-8 range used in the bytelevel mapping. This commit tests the extend of the damage of ignoring the decoder for those tokens. * Format. * Installing cargo audit. * Minor fix. * Fixing "bug" in node/python. * Autoformat. * Clippy. * Only prefix space when there's no decoder.	2024-05-06 11:49:38 +02:00
Arthur	f2ec3b239b	remove enforcement of non special when adding tokens (#1521 ) * remove enforcement of non special when adding tokens * mut no longer needed * add a small test * nit * style * audit * ignore cargo audit's own vulnerability * update * revert * remove CVE	2024-04-30 15:53:47 +02:00
Arthur Zucker	71c2a8d01a	update dev version so 0.19.1	2024-04-17 23:17:12 +02:00
Arthur	7733bc25d6	add serialization for `ignore_merges` (#1504 ) * add serialization for `ignore_merges` * add serialization tests * deserialize without `ignore_merges`	2024-04-17 21:56:48 +02:00
Nicolas Patry	91393ef75e	Fixing doc. (#1499 ) * Fixing doc. * SentencePieceUnigram and Convert.py still used sentencepiece * stub --------- Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>	2024-04-17 09:32:40 +02:00
Nicolas Patry	949d9e3e0e	Bumping all versions 3 times (ty transformers :) ) (#1498 )	2024-04-16 15:58:36 +02:00
Nicolas Patry	e0defa7355	Remove 3.13 (potential undefined behavior.) (#1497 )	2024-04-16 15:56:47 +02:00
Nicolas Patry	d5a8cc7a49	PyO3 0.21. (#1494 ) * PyO3 0.21. * Upgraded everything. * Rustfmt.	2024-04-16 13:49:52 +02:00

1 2 3 4 5 ...

1865 Commits