tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
Nicolas Patry	e4aea890d5	Adding 2 new decoders: (#1196 ) * Adding 2 new decoders: - Fuse will simply concatenate all tokens into 1 string - Strip will remove n char from left or right Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want for the `Metaspace` thing. - Note: Added a new dependency from better parsing of decoders. This is due to untagged enums which can match anything the `MustBe` ensure there's no issue between Fuse and ByteFallback. Since both are new the chances for backward incompatibility is low. * Fixing picking/unpickling (using default args.). * Stub. * Black. * Fixing node.	2023-03-24 00:50:54 +01:00
Nicolas Patry	d2c8190a0f	Creating `normalizers.Prepend` (To be used instead of `Metaspace`). (#1194 ) * Creating `normalizers.Prepend` (To be used instead of `Metaspace`). * Linting + stub. * Fixing pickling/unpickling by setting a default. * Black.	2023-03-24 00:33:31 +01:00
Nicolas Patry	250d46c676	Adding `Replace` to decoder (to undo the Replace Normalizer for (#1195 ) Metaspace split).	2023-03-23 23:43:47 +01:00
Quentin Lhoest	178e294a6a	Merge pull request #1192 from huggingface/faster-datasets-train-example Faster `datasets` train example	2023-03-23 17:19:05 +01:00
Nicolas Patry	73637a0004	Adding ByteFallback support for `tokenizers`. (#1183 ) * Adding ByteFallback support for `tokenizers`. Two items added: - A flag `byte_fallback` for the `BPE` model. This will be in charge of using `<0x61>` instead of unk on unknown tokens. - A ByteFallback decoder, which will be in charge of putting everything back into string whenever possible. Showing � when the byte decoding fails (behavior checked against LlamaTokenizer in `transformers`. * Update rustdoc. * Clippy + Add BPE(byte_fallback) into bindings. * Stupid file. * Test artifacts removed. * Update stub. * Fix. * Bad file. * CRITICAL FIX: wrapper order because of untagged.... * Remove prints. * Fixing <16 byte fallback.	2023-03-23 16:04:32 +01:00
Quentin Lhoest	e76f900bc0	Faster `datasets` train example Using .iter() is much faster than accessing using row ids	2023-03-23 11:24:30 +01:00
Roy Hvaara	b8fbea00a9	Bump dirs from 3.0 to 4.0 (#1142 )	2023-03-21 10:32:02 +01:00
Nicolas Patry	5ecd329503	Fixing infinite loop in UnigramTrainer. (#1182 ) * Fixing infinite loop in UnigramTrainer. * Newer clippy.	2023-03-15 14:59:01 +01:00
dependabot[bot]	9c0e700212	Bump webpack in /tokenizers/examples/unstable_wasm/www (#1181 ) Bumps [webpack](https://github.com/webpack/webpack) from 5.75.0 to 5.76.0. - [Release notes](https://github.com/webpack/webpack/releases) - [Commits](https://github.com/webpack/webpack/compare/v5.75.0...v5.76.0) --- updated-dependencies: - dependency-name: webpack dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-03-15 10:54:26 +01:00
mert-kurttutan	5c18ec5ff5	pyo3 v0.18 migration (#1173 ) * pyo v0.18 migration * Fix formatting issues of black	2023-03-08 11:27:47 +01:00
Nicolas Patry	3138657565	Using clippy 1.67 (#1167 )	2023-03-02 12:28:39 +01:00
Thomas Wang	ac552ff8b9	Update model.rs (#1166 )	2023-02-28 17:35:57 +01:00
Nicolas Patry	fa66caf0ab	Improved version. (#1154 ) * Improved version. * Clippy.	2023-01-23 16:35:19 +01:00
Nicolas Patry	d09241fba1	Prevent using `from_pretrained` on invalid ids (better error message). (#1153 )	2023-01-23 15:38:14 +01:00
Nicolas Patry	b861d48b06	Making `Tokenizer` clone. (#1152 )	2023-01-23 10:12:35 +01:00
mert-kurttutan	1fcd90b0b7	Update info on environment variable for threading (#1150 ) * Update env var name for threading * Update env var name for threading	2023-01-22 21:24:41 +01:00
Andrew Kane	33a57e6418	Made dirs optional (#1148 )	2023-01-18 09:29:15 +01:00
Nicolas Patry	daf8aebd76	Adding python 3.8 for M1 (#1147 )	2023-01-16 16:40:46 +01:00
Nicolas Patry	5a94a2b6e7	Add missing build targets (#1145 ) * M1 3.11 was not out neither windows amd64. * python@v4. * Actually upload. * Update needs. * Preparing the actual PR.	2023-01-15 10:18:08 +01:00
dependabot[bot]	fe4ae7dc38	Bump json5 from 2.2.0 to 2.2.3 in /bindings/node (#1140 ) Bumps [json5](https://github.com/json5/json5) from 2.2.0 to 2.2.3. - [Release notes](https://github.com/json5/json5/releases) - [Changelog](https://github.com/json5/json5/blob/main/CHANGELOG.md) - [Commits](https://github.com/json5/json5/compare/v2.2.0...v2.2.3) --- updated-dependencies: - dependency-name: json5 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-01-03 11:50:51 +01:00
dependabot[bot]	c3fedd96b3	Bump json5, copy-webpack-plugin, webpack and webpack-cli (#1139 ) Removes [json5](https://github.com/json5/json5). It's no longer used after updating ancestor dependencies [json5](https://github.com/json5/json5), [copy-webpack-plugin](https://github.com/webpack-contrib/copy-webpack-plugin), [webpack](https://github.com/webpack/webpack) and [webpack-cli](https://github.com/webpack/webpack-cli). These dependencies need to be updated together. Removes `json5` Updates `copy-webpack-plugin` from 5.1.2 to 11.0.0 - [Release notes](https://github.com/webpack-contrib/copy-webpack-plugin/releases) - [Changelog](https://github.com/webpack-contrib/copy-webpack-plugin/blob/master/CHANGELOG.md) - [Commits](https://github.com/webpack-contrib/copy-webpack-plugin/compare/v5.1.2...v11.0.0) Updates `webpack` from 4.46.0 to 5.75.0 - [Release notes](https://github.com/webpack/webpack/releases) - [Commits](https://github.com/webpack/webpack/compare/v4.46.0...v5.75.0) Updates `webpack-cli` from 3.3.12 to 5.0.1 - [Release notes](https://github.com/webpack/webpack-cli/releases) - [Changelog](https://github.com/webpack/webpack-cli/blob/master/CHANGELOG.md) - [Commits](https://github.com/webpack/webpack-cli/compare/v3.3.12...webpack-cli@5.0.1) --- updated-dependencies: - dependency-name: json5 dependency-type: indirect - dependency-name: copy-webpack-plugin dependency-type: direct:development - dependency-name: webpack dependency-type: direct:development - dependency-name: webpack-cli dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-01-03 10:22:49 +01:00
SeongBeomLEE	9b155b5723	[FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. (#1136 ) * [fix] Use unk_token In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. * [fix] If unk_token is None, this case is also considered. * Update bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * [FIX] In CharBPETokenizer, Use unk_token. In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. * Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-12-27 11:13:52 +01:00
fzyzcjy	60a00dda44	Fix one char super tiny typo (#1137 ) * Update pipeline.mdx * Update pipeline.rst	2022-12-26 11:13:38 +01:00
Roy Hvaara	4d520c9664	Ignore Cargo.lock for subfolders (#1131 )	2022-12-25 11:35:47 +01:00
Roy Hvaara	fbad581128	Bump derive_builder from 0.9 to 0.12 (#1129 )	2022-12-23 23:37:16 +01:00
Roy Hvaara	2bed678958	Fix broken links in docs (#1133 )	2022-12-23 23:35:18 +01:00
Roy Hvaara	3e7476de86	Wrap rustdoc html entity in code block (#1130 )	2022-12-23 23:30:45 +01:00
Roy Hvaara	03ce27d2fa	Bump cached-path from 0.5 to 0.6 (#1127 )	2022-12-21 18:10:48 +01:00
dependabot[bot]	5886179eee	Bump decode-uri-component in /tokenizers/examples/unstable_wasm/www (#1125 ) Bumps [decode-uri-component](https://github.com/SamVerschueren/decode-uri-component) from 0.2.0 to 0.2.2. - [Release notes](https://github.com/SamVerschueren/decode-uri-component/releases) - [Commits](https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.0...v0.2.2) --- updated-dependencies: - dependency-name: decode-uri-component dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-12-19 14:24:24 +01:00
dependabot[bot]	a408b44429	Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node (#1126 ) Bumps [minimatch](https://github.com/isaacs/minimatch) from 3.0.4 to 3.1.2. - [Release notes](https://github.com/isaacs/minimatch/releases) - [Changelog](https://github.com/isaacs/minimatch/blob/main/changelog.md) - [Commits](https://github.com/isaacs/minimatch/compare/v3.0.4...v3.1.2) --- updated-dependencies: - dependency-name: minimatch dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-12-19 14:09:24 +01:00
Nicolas Patry	bfa842e063	Adding stale bot ? (#1123 ) * Adding stale bot ? * Clippy.	2022-12-19 13:50:48 +01:00
Nicolas Patry	1649d74536	Fixing conda ssl location (#1124 ) * Fixing conda build ? * Reduce the scope to speedup testing. * Reduce more. * Trying to link to conda lib. * Trying to enable `pkg-config` on the codna env. * Really publish. * Update conda builds. * Remove 3.11 * Putting releases back onto release track.	2022-12-19 13:50:36 +01:00
SeongBeomLEE	9a25b2cb8e	[FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. (#1120 ) * [fix] Use unk_token In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. * [fix] If unk_token is None, this case is also considered. * Update bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-12-19 13:40:04 +01:00
dependabot[bot]	102dfe87a3	Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node (#1116 ) Bumps [decode-uri-component](https://github.com/SamVerschueren/decode-uri-component) from 0.2.0 to 0.2.2. - [Release notes](https://github.com/SamVerschueren/decode-uri-component/releases) - [Commits](https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.0...v0.2.2) --- updated-dependencies: - dependency-name: decode-uri-component dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-12-05 18:09:38 +01:00
Andrew Kane	67080e163a	Include license file in Rust crate (#1115 ) * Include license file in Rust crate * Ignore security warning. * Also for python. * Upgrading ubuntu version. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-11-30 23:17:56 +01:00
dependabot[bot]	c74e9e62f6	Bump loader-utils in /tokenizers/examples/unstable_wasm/www (#1108 ) Bumps [loader-utils](https://github.com/webpack/loader-utils) from 1.4.0 to 1.4.2. - [Release notes](https://github.com/webpack/loader-utils/releases) - [Changelog](https://github.com/webpack/loader-utils/blob/v1.4.2/CHANGELOG.md) - [Commits](https://github.com/webpack/loader-utils/compare/v1.4.0...v1.4.2) --- updated-dependencies: - dependency-name: loader-utils dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-11-16 12:01:25 +01:00
Mishig	e9529cb02f	Merge pull request #1107 from huggingface/revert-1101-update_doc_pr_actions Revert "Update pr docs actions"	2022-11-16 11:41:51 +01:00
Mishig	ffcf5a4136	Revert "Update pr docs actions (#1101 )" This reverts commit `99c06c82e0`.	2022-11-16 11:41:38 +01:00
Nicolas Patry	bbae829a72	Adding rust audit. (#1099 ) * Adding rust audit. * Update clap version + derive_builder (they clashed). * Ignoring specific CVE which can be ignored https://github.com/Azure/iot-identity-service/issues/481 * Updating python lock. * Revert `derive-builder` update. * Adding back help msg.	2022-11-09 12:59:36 +01:00
Mishig	99c06c82e0	Update pr docs actions (#1101 )	2022-11-09 11:09:52 +01:00
Nicolas Patry	b8a4aa6000	Fixing extra wheels memory usage. (#1098 )	2022-11-07 09:11:18 +01:00
Cameron	11bb2e00f2	Add python 3.11 to manylinux buildwheels (#1096 ) * Add python 3.11 to manylinux buildwheels * Fixing clippy. * Node clippy. * Python clippy. * Changelog + version number update. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-11-07 08:45:04 +01:00
Nicolas Patry	96a9e5715c	New version. (#1082 ) * New version. The actual release will happen before PyO3 0.17.2 because the tests were ran before than. * Manylinux2014 necessary now with Rust 1.64.	2022-10-06 15:45:56 +02:00
Nicolas Patry	4ef0afbeb6	Update old gh actions, remove deprecated doc building. (#1069 )	2022-10-05 17:59:46 +02:00
David Hewitt	8129dd3309	pyo3: update to 0.17 (#1066 ) * python: update bindings to edition 2021 * python: update to pyo3 0.17 * Updating testing. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-10-05 16:59:01 +02:00
Nicolas Patry	6113666624	Updating python formatting. (#1079 ) * Updating python formatting. * Forgot gh action. * Skipping isort to prevent circular imports. * Updating stub. * Removing `isort` (it contradicts `stub.py`). * Fixing weird stub black/isort disagreeement.	2022-10-05 15:29:33 +02:00
Nicolas Patry	5f6e978452	Fixing roberta type id (everything is zero). (#1072 ) * Fixing roberta type ids (everything is zero). * We need to fix type_ids for all sequence even when not changing anything else. * Fixing tests hopefully better.	2022-09-26 18:00:41 +02:00
Nicolas Patry	6e5569a540	Moving versions numbers to `dev` mode. (#1067 )	2022-09-22 18:24:07 +02:00
Nicolas Patry	63082c4d11	Enabling static interpreter embedding for manylinux. (#1064 ) * Removing dead file. * Checking that we can distribute with static python embedding for manylinux * Many linux embed interpreter. * Building wheels manylinux with static embedding * Better script. * typo. * Using a dummy feature? * default features ? * Back into order. * Fixing manylinux ??. * Local dir. * Missing star. * Makedir ? * Monkey coding this. * extension module ? * Building with default features `RustExtension`. * bdist_wheel + rustextension any better ? * update rust-py version. * Forcing extension module. * No default features. * Remove py37 out of spite * Revert "Remove py37 out of spite" This reverts commit 6ab7facd792b59c2e30be82fe42816d24c32cf0d. * Really extraneous feature. * Fix build wheels. * Putting things back in place.	2022-09-21 12:18:46 +02:00
Nicolas Patry	655f4057b7	Removing python3.6 from manylinux it's not supported anymore. (#1063 )	2022-09-19 12:22:02 +02:00

1 2 3 4 5 ...

1643 Commits