tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
co63oc	f1faec1756	Fix typos in strings and comments (#1770 )	2025-05-27 08:17:36 +02:00
Nicolas Patry	4383a25787	Update the release builds following 0.21.1. (#1746 ) * Update the release builds following 0.21.1. * Clippy fix.	2025-03-13 13:01:41 +01:00
Arthur	24d29f498d	Update dev version and pyproject.toml (#1693 ) * update pyproject.toml * update py dev version	2024-11-27 16:01:48 +01:00
Arthur Zucker	1bf2a66b80	v0.20.4-dev0	2024-11-27 10:07:49 +01:00
dependabot[bot]	eb4cc86d4e	Bump cross-spawn from 6.0.5 to 6.0.6 in /bindings/node (#1687 ) Bumps [cross-spawn](https://github.com/moxystudio/node-cross-spawn) from 6.0.5 to 6.0.6. - [Changelog](https://github.com/moxystudio/node-cross-spawn/blob/v6.0.6/CHANGELOG.md) - [Commits](https://github.com/moxystudio/node-cross-spawn/compare/v6.0.5...v6.0.6) --- updated-dependencies: - dependency-name: cross-spawn dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-11-25 10:04:06 +01:00
Arthur Zucker	7c36735389	v0.20.2-dev.0 version	2024-11-04 18:36:40 +01:00
Arthur Zucker	51826532d4	push new dev version	2024-10-10 12:00:16 +02:00
Arthur Zucker	81c471cf17	update dev version 0.20.0	2024-08-08 18:11:50 +02:00
Arthur	bded212356	Support `None` to reset pre_tokenizers and normalizers, and index sequences (#1590 ) * initial commit * support None * fix clippy * cleanup * clean? * propagate to pre_tokenizer * fix test * fix rust tests * fix node * propagate to decoder and post processor * fix calls * lint * fmt * node be happy I am fixing you * initial commit * support None * fix clippy * cleanup * clean? * propagate to pre_tokenizer * fix test * fix rust tests * fix node * propagate to decoder and post processor * fix calls * lint * fmt * node be happy I am fixing you * add a small test * styling * style merge * fix merge test * fmt * nits * update tset	2024-08-07 12:52:35 +02:00
Arthur Zucker	71c2a8d01a	update dev version so 0.19.1	2024-04-17 23:17:12 +02:00
Nicolas Patry	949d9e3e0e	Bumping all versions 3 times (ty transformers :) ) (#1498 )	2024-04-16 15:58:36 +02:00
Arthur Zucker	6e58f838b3	version = "0.16.0-dev.0"	2024-04-02 09:51:14 +02:00
Arthur	09069717e9	Refactor metaspace (#1476 ) * version = "0.15.3-dev-0” Improve performances of meta space, but also just fix it. (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] ['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] [0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848] (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] ['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] [0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416] * well what do we have * nit * be BC with non legacy * unrelated change for clippy * fix test * splitting is a must for word_ids * fmt and lint * Fixing everything (hopefully better). * Fixing node. * Including yarn.lock * Lint. * Stubs. * revert to use split * fix merge issues * fix tests * finish fixing tests * ruff --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-03-30 10:27:24 +01:00
dependabot[bot]	d8c4388166	Bump ip from 2.0.0 to 2.0.1 in /bindings/node (#1456 ) Bumps [ip](https://github.com/indutny/node-ip) from 2.0.0 to 2.0.1. - [Commits](https://github.com/indutny/node-ip/compare/v2.0.0...v2.0.1) --- updated-dependencies: - dependency-name: ip dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-03-25 11:29:36 +01:00
Arthur Zucker	7f49f20ab0	version = "0.15.3-dev-0”	2024-02-12 09:48:00 +09:00
Arthur Zucker	8f73fe9515	update dev version to 0.15.2-dev.0	2024-01-22 15:34:57 +01:00
Arthur	e3bcef288b	udpate to version = "0.15.1-dev0" (#1390 ) * Apply suggestions from code review	2023-11-15 13:30:58 +01:00
dependabot[bot]	c718c53bb9	Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node (#1370 ) Bumps [@babel/traverse](https://github.com/babel/babel/tree/HEAD/packages/babel-traverse) from 7.22.11 to 7.23.2. - [Release notes](https://github.com/babel/babel/releases) - [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md) - [Commits](https://github.com/babel/babel/commits/v7.23.2/packages/babel-traverse) --- updated-dependencies: - dependency-name: "@babel/traverse" dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-10-25 08:14:32 +02:00
Nicolas Patry	4322056e6e	Preparing release. (#1355 ) * Preparing release. * Fix new clippy	2023-10-06 12:56:36 +02:00
Arthur	2c565e42c7	update package version for dev (#1339 )	2023-09-07 16:19:24 +02:00
Nicolas Patry	d2010d5165	Move to maturing mimicking move for `safetensors`. + Rewritten node bindings. (#1331 ) * Move to maturing mimicking move for `safetensors`. * Tmp. * Fix sdist. * Wat? * Clippy 1.72 * Remove if. * Conda sed. * Fix doc check workflow. * Moving to maturin AND removing http + openssl mess (smoothing transition moving to `huggingface_hub`) * Fix dep * Black. * New node bindings. * Fix docs + node cache ? * Yarn. * Working dir. * Extension module. * Put back interpreter. * Remove cache. * New attempt * Multi python. * Remove FromPretrained. * Remove traces of `fromPretrained`. * Drop 3.12 for windows? * Typo. * Put back the default feature for ignoring links during simple test. * Fix ? * x86_64 -> x64. * Remove warning for windows bindings. * Excluse aarch. * Include/exclude. * Put back workflows in correct states.	2023-08-28 16:24:14 +02:00
Nicolas Patry	fb292d1eae	0.13.4.rc1 (#1319 )	2023-08-14 12:06:43 +02:00
Chris Ha	862046ac94	CD backports (#1318 ) * CD backports follow huggingface/safetensors#317 * fix node bindings? `cargo check` doesnt work on my local configuration from `tokenizers/bindings/node/native` i don't think it will be a problem but i have difficulty telling * backport #315 * safetensors#317 back ports	2023-08-10 18:52:22 +02:00
dependabot[bot]	ea4d3f634c	Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node (#1299 ) Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.4. - [Release notes](https://github.com/jonschlinkert/word-wrap/releases) - [Commits](https://github.com/jonschlinkert/word-wrap/compare/1.2.3...1.2.4) --- updated-dependencies: - dependency-name: word-wrap dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-21 08:08:10 +02:00
Nicolas Patry	291b2e23ae	Fixing clippy warnings on 1.71. (#1296 ) * Fixing clippy warnings on 1.71. * Fix. * Fmt. * Python clippy. * Should really set my env back again. * Fix.	2023-07-16 15:58:38 +02:00
dependabot[bot]	92bfb9c993	Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node (#1291 ) Bumps [tough-cookie](https://github.com/salesforce/tough-cookie) from 4.0.0 to 4.1.3. - [Release notes](https://github.com/salesforce/tough-cookie/releases) - [Changelog](https://github.com/salesforce/tough-cookie/blob/master/CHANGELOG.md) - [Commits](https://github.com/salesforce/tough-cookie/compare/v4.0.0...v4.1.3) --- updated-dependencies: - dependency-name: tough-cookie dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-10 09:44:31 +02:00
Arthur	864135bef1	Add unigram bytefallback (#1217 ) * current updates will go red * cargo fmt * npm install * refactor train for unigram to allow bytefallbakc (breaking) * fmt * nits * update * add a proper test * fix encode optimised fallback + add trainer arg * fixes * fixes * fix tests * add test * fmt * fix rust test * update python bindings * update * pub is okay and needed * more fix * cleanup * remove useles id * MissingUnkId error * nits * fix offset * add a test in python * update src bindings * remove bytefallback from trainer * styling * update pckg * lint * fmt * stup with dev * update code based on review * remove unused function * udpate python test to compare ids * fix option bool issues * final fix * clippy * fix npm isntall * update * update test * more in depth testing * Lint * last attempt to fix node * update node bindings * fmt * Update tokenizers/src/models/unigram/model.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * update based on review * simpler test * lint --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-06-26 10:46:59 +02:00
Funtowicz Morgan	b4fcc9ce6e	Makes `decode` and `decode_batch` work on borrowed content. (#1251 ) * Makes `decode` and `decode_batch` work on borrowed content. * Make `decode_batch` work with borrowed content. * Fix lint. * Attempt to map it into Node. * Second attempt. * Step by step. * One more step. * Fix lint. * Please ... * Removing collect. * Revert "Removing collect." This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-17 11:18:15 +02:00
Chris Ha	cefc41e8ec	implement a simple max_sentencepiece_length into BPE (#1228 ) * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * utilize Option<u16> for safer code. * Other version. * Update trainer.rs clarify with type usize propagate max_length option * change max_length into more descriptive name in the documentation https://huggingface.co/docs/tokenizers/api/trainers unigramtrainer uses max_piece_length for similar function. since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive. * change variable name in trainer.rs change max_merge_length into max_token_length * Update trainer.rs add several max_token_length declaration that were missing. impl BpeTrainerBuilder struct BpeTrainer Add explanation for variable shadowing. * Update trainer.rs Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly. * add max_token_length test * Add bpe direct assert test * Update trainer.rs clarified test documentation * Creating the bindings. * Fix the default. * Re-adding missing package-lock which I accidentally removed. * .. * Fixing trainer test. * Fix. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-16 10:08:19 +02:00
Nicolas Patry	02ad59edc1	Never gonna run around and desert you	2023-05-12 16:27:06 +02:00
Nicolas Patry	1cb44bd180	New version 0.13.3	2023-04-04 14:14:17 +02:00
Nicolas Patry	3aaf4946b3	Add `content` to Strip decoder to allow decoding mid tokens. (#1199 ) * Add `content` to Strip decoder to allow decoding mid tokens. * Stub. * Clippy.	2023-03-24 10:14:49 +01:00
Nicolas Patry	e4aea890d5	Adding 2 new decoders: (#1196 ) * Adding 2 new decoders: - Fuse will simply concatenate all tokens into 1 string - Strip will remove n char from left or right Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want for the `Metaspace` thing. - Note: Added a new dependency from better parsing of decoders. This is due to untagged enums which can match anything the `MustBe` ensure there's no issue between Fuse and ByteFallback. Since both are new the chances for backward incompatibility is low. * Fixing picking/unpickling (using default args.). * Stub. * Black. * Fixing node.	2023-03-24 00:50:54 +01:00
Nicolas Patry	d2c8190a0f	Creating `normalizers.Prepend` (To be used instead of `Metaspace`). (#1194 ) * Creating `normalizers.Prepend` (To be used instead of `Metaspace`). * Linting + stub. * Fixing pickling/unpickling by setting a default. * Black.	2023-03-24 00:33:31 +01:00
Nicolas Patry	250d46c676	Adding `Replace` to decoder (to undo the Replace Normalizer for (#1195 ) Metaspace split).	2023-03-23 23:43:47 +01:00
Nicolas Patry	73637a0004	Adding ByteFallback support for `tokenizers`. (#1183 ) * Adding ByteFallback support for `tokenizers`. Two items added: - A flag `byte_fallback` for the `BPE` model. This will be in charge of using `<0x61>` instead of unk on unknown tokens. - A ByteFallback decoder, which will be in charge of putting everything back into string whenever possible. Showing � when the byte decoding fails (behavior checked against LlamaTokenizer in `transformers`. * Update rustdoc. * Clippy + Add BPE(byte_fallback) into bindings. * Stupid file. * Test artifacts removed. * Update stub. * Fix. * Bad file. * CRITICAL FIX: wrapper order because of untagged.... * Remove prints. * Fixing <16 byte fallback.	2023-03-23 16:04:32 +01:00
dependabot[bot]	fe4ae7dc38	Bump json5 from 2.2.0 to 2.2.3 in /bindings/node (#1140 ) Bumps [json5](https://github.com/json5/json5) from 2.2.0 to 2.2.3. - [Release notes](https://github.com/json5/json5/releases) - [Changelog](https://github.com/json5/json5/blob/main/CHANGELOG.md) - [Commits](https://github.com/json5/json5/compare/v2.2.0...v2.2.3) --- updated-dependencies: - dependency-name: json5 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-01-03 11:50:51 +01:00
Roy Hvaara	4d520c9664	Ignore Cargo.lock for subfolders (#1131 )	2022-12-25 11:35:47 +01:00
Roy Hvaara	fbad581128	Bump derive_builder from 0.9 to 0.12 (#1129 )	2022-12-23 23:37:16 +01:00
dependabot[bot]	a408b44429	Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node (#1126 ) Bumps [minimatch](https://github.com/isaacs/minimatch) from 3.0.4 to 3.1.2. - [Release notes](https://github.com/isaacs/minimatch/releases) - [Changelog](https://github.com/isaacs/minimatch/blob/main/changelog.md) - [Commits](https://github.com/isaacs/minimatch/compare/v3.0.4...v3.1.2) --- updated-dependencies: - dependency-name: minimatch dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-12-19 14:09:24 +01:00
dependabot[bot]	102dfe87a3	Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node (#1116 ) Bumps [decode-uri-component](https://github.com/SamVerschueren/decode-uri-component) from 0.2.0 to 0.2.2. - [Release notes](https://github.com/SamVerschueren/decode-uri-component/releases) - [Commits](https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.0...v0.2.2) --- updated-dependencies: - dependency-name: decode-uri-component dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-12-05 18:09:38 +01:00
Nicolas Patry	bbae829a72	Adding rust audit. (#1099 ) * Adding rust audit. * Update clap version + derive_builder (they clashed). * Ignoring specific CVE which can be ignored https://github.com/Azure/iot-identity-service/issues/481 * Updating python lock. * Revert `derive-builder` update. * Adding back help msg.	2022-11-09 12:59:36 +01:00
Cameron	11bb2e00f2	Add python 3.11 to manylinux buildwheels (#1096 ) * Add python 3.11 to manylinux buildwheels * Fixing clippy. * Node clippy. * Python clippy. * Changelog + version number update. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-11-07 08:45:04 +01:00
Nicolas Patry	96a9e5715c	New version. (#1082 ) * New version. The actual release will happen before PyO3 0.17.2 because the tests were ran before than. * Manylinux2014 necessary now with Rust 1.64.	2022-10-06 15:45:56 +02:00
Nicolas Patry	6e5569a540	Moving versions numbers to `dev` mode. (#1067 )	2022-09-22 18:24:07 +02:00
Nicolas Patry	7bfab48979	Preparing rc1 release. (#1056 ) * Preparing rc1 release. * Fixing test_alignment_methods * Fixing the overflowing sequence_id issue (LayoutLMv2 tests caught this). * Adding overly complex overflowing test.	2022-09-12 16:07:06 +02:00
Nicolas Patry	06025e4ca1	Adding `Sequence` for `PostProcessor`. (#1052 ) * Adding `Sequence` for `PostProcessor`. * Fixing node? Writing in the dark here, don't have Python2.7 * `undefined` is not accepted. * Other test.	2022-08-25 14:50:06 +02:00
Nicolas Patry	460bdded80	Modify `Processor` trait to support chaining. (#1054 ) 0 modifications yet, everything will consume the vector. Every test should be green without any modifications.	2022-08-24 19:49:23 +02:00
Nicolas Patry	b1c9bc68b5	Updating code according to clippy. (#1048 ) - Adding `Eq` where possible - Denied the ref deref warnings as it was spamming and solution not really better.	2022-08-24 19:45:15 +02:00
Arthur	eb2213842b	Update README.md (#1019 ) * Update README.md Add reference to normalizer blog post * Update lib.rs * Fixing PR + clippy on node. * Update readme to match docstring. * Other clippy warning. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-07-19 09:54:29 +02:00

1 2 3 4 5 ...

363 Commits