tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
Mishig	348ed70e58	[doc build] Use secrets (#1273 )	2023-06-09 12:58:27 +02:00
Santosh Bhavani	5d70f15bfb	Update README.md - Broken link (#1272 ) * Update README.md - Broken link fixed "python documentation" link * Update README.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2023-06-08 10:20:11 +02:00
Chris Ha	f85e8467e4	Update Cargo.toml (#1266 ) `cargo update` yield the following Updating regex v1.8.1 -> v1.8.3 Updating regex-syntax v0.7.1 -> v0.7.2	2023-06-07 09:57:18 +02:00
Chris Ha	cb8d4de599	fix documentation regarding regex (#1264 ) * fix documentation regarding regex Split() in pre_tokenizers.rs and normalizations take a regex that is required to be built with a tokenizer specific regex module. Clarify this in the documentation. * Update __init__.pyi fixed __init__.pyi * Update bindings/python/py_src/tokenizers/__init__.pyi Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update bindings/python/py_src/tokenizers/__init__.pyi Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Revert "Update bindings/python/py_src/tokenizers/__init__.pyi" This reverts commit 6e8bdfcddf67bcdd8e3b1a78685fd5ef8f6a153c. * Revert "Update bindings/python/py_src/tokenizers/__init__.pyi" This reverts commit 897b0c0de471ad7cb6269b8456347c4e5cff2aaf. * Revert "Update __init__.pyi" This reverts commit fbe82310b7728ee7cdb6f8b38fbc2388f9d95771. * add codeblocks the right way * add codeblocks with stub.py ran setup.py install to build, and then ran stub.py --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2023-06-07 09:41:28 +02:00
Nicolas Patry	c7102c4c0f	Fixing broken link. (#1268 )	2023-06-06 11:10:28 +02:00
Chris Ha	cb819724ef	Update trainer.rs (#1257 ) implement skip for empty sentences. refer to : https://github.com/google/sentencepiece/blob/master/src/trainer_interface.cc#L373	2023-05-25 12:24:29 +02:00
Mishig	fc76ad4f07	Parallelize unigram trainer (#976 ) * Parallelize unigram trainer Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Rm unused lifetime --------- Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>	2023-05-22 15:36:03 +02:00
Funtowicz Morgan	a03330607b	Update all GH Actions with dependency on actions/checkout from v[1,2] to v3 to notably improve performance (retrieve only the commit being checked-out) (#1256 )	2023-05-22 14:50:00 +02:00
Funtowicz Morgan	b4fcc9ce6e	Makes `decode` and `decode_batch` work on borrowed content. (#1251 ) * Makes `decode` and `decode_batch` work on borrowed content. * Make `decode_batch` work with borrowed content. * Fix lint. * Attempt to map it into Node. * Second attempt. * Step by step. * One more step. * Fix lint. * Please ... * Removing collect. * Revert "Removing collect." This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-17 11:18:15 +02:00
Chris Ha	cefc41e8ec	implement a simple max_sentencepiece_length into BPE (#1228 ) * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * utilize Option<u16> for safer code. * Other version. * Update trainer.rs clarify with type usize propagate max_length option * change max_length into more descriptive name in the documentation https://huggingface.co/docs/tokenizers/api/trainers unigramtrainer uses max_piece_length for similar function. since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive. * change variable name in trainer.rs change max_merge_length into max_token_length * Update trainer.rs add several max_token_length declaration that were missing. impl BpeTrainerBuilder struct BpeTrainer Add explanation for variable shadowing. * Update trainer.rs Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly. * add max_token_length test * Add bpe direct assert test * Update trainer.rs clarified test documentation * Creating the bindings. * Fix the default. * Re-adding missing package-lock which I accidentally removed. * .. * Fixing trainer test. * Fix. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-16 10:08:19 +02:00
Nicolas Patry	daf3fcc976	Rvert main hiccup.	2023-05-15 18:01:29 +02:00
Nicolas Patry	b58227c7f1	Never gonna make you cry	2023-05-12 16:28:57 +02:00
Nicolas Patry	02ad59edc1	Never gonna run around and desert you	2023-05-12 16:27:06 +02:00
Nicolas Patry	8d07696c38	Never gonna let you down	2023-05-12 16:24:26 +02:00
Chris Ha	4518b0f7f2	fix unigram.rs test_sample() (#1244 ) `87230bb59b/tokenizers/tests/unigram.rs (LL71C1-L71C53)` When running cargo test --release, the above line causes an error. referring to `87230bb59b/tokenizers/src/models/unigram/lattice.rs (L138)` It seems that lattice::from should only take 3 arguments. If i had to guess, it should be Lattice::from("ABC", 0, 2); This change makes cargo test --release pass without error.	2023-05-10 17:04:34 +02:00
Kornél Csernai	87230bb59b	use LTO for release and benchmark builds (#1157 )	2023-05-09 16:15:57 +02:00
Nicolas Patry	15085ef905	Fixing padding_left sequence_ids. (#1233 )	2023-05-04 15:57:20 +02:00
Nicolas Patry	ef5f50605d	Printing warning to stderr. (#1222 )	2023-04-19 14:55:24 +02:00
Arthur	d19bc63c67	Merge pull request #1212 from huggingface/fix-node-release Fix node release	2023-04-06 16:25:29 +02:00
arthur.zucker@gmail.com	a714aac6f6	revert changes	2023-04-06 14:07:46 +00:00
arthur.zucker@gmail.com	ceb73dbd29	publish npm	2023-04-06 13:35:29 +00:00
Arthur	42b110587c	Fix conda release (#1211 ) * runs-on: ubuntu-latest * update jobs * remove uploads * upload conda package, remove python-release CI * revert changes * revart whitespace removal	2023-04-06 12:30:14 +02:00
arthur.zucker@gmail.com	fbd8d6188e	update for testing	2023-04-06 10:29:42 +00:00
Arthur	37372b67fa	Merge pull request #1207 from huggingface/v0.13.3 New release	2023-04-05 09:58:19 +02:00
Arthur	ce244bd094	remove rc1	2023-04-04 16:19:42 +02:00
Arthur	a05be6b8d1	Merge pull request #1205 from huggingface/new_version New version 0.13.3	2023-04-04 15:03:38 +02:00
Nicolas Patry	1cb44bd180	New version 0.13.3	2023-04-04 14:14:17 +02:00
Nicolas Patry	3aaf4946b3	Add `content` to Strip decoder to allow decoding mid tokens. (#1199 ) * Add `content` to Strip decoder to allow decoding mid tokens. * Stub. * Clippy.	2023-03-24 10:14:49 +01:00
Nicolas Patry	8a6a8dc9d5	Fixing decoder strip because of char boundaries. (#1197 )	2023-03-24 01:57:39 +01:00
Nicolas Patry	e4aea890d5	Adding 2 new decoders: (#1196 ) * Adding 2 new decoders: - Fuse will simply concatenate all tokens into 1 string - Strip will remove n char from left or right Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want for the `Metaspace` thing. - Note: Added a new dependency from better parsing of decoders. This is due to untagged enums which can match anything the `MustBe` ensure there's no issue between Fuse and ByteFallback. Since both are new the chances for backward incompatibility is low. * Fixing picking/unpickling (using default args.). * Stub. * Black. * Fixing node.	2023-03-24 00:50:54 +01:00
Nicolas Patry	d2c8190a0f	Creating `normalizers.Prepend` (To be used instead of `Metaspace`). (#1194 ) * Creating `normalizers.Prepend` (To be used instead of `Metaspace`). * Linting + stub. * Fixing pickling/unpickling by setting a default. * Black.	2023-03-24 00:33:31 +01:00
Nicolas Patry	250d46c676	Adding `Replace` to decoder (to undo the Replace Normalizer for (#1195 ) Metaspace split).	2023-03-23 23:43:47 +01:00
Quentin Lhoest	178e294a6a	Merge pull request #1192 from huggingface/faster-datasets-train-example Faster `datasets` train example	2023-03-23 17:19:05 +01:00
Nicolas Patry	73637a0004	Adding ByteFallback support for `tokenizers`. (#1183 ) * Adding ByteFallback support for `tokenizers`. Two items added: - A flag `byte_fallback` for the `BPE` model. This will be in charge of using `<0x61>` instead of unk on unknown tokens. - A ByteFallback decoder, which will be in charge of putting everything back into string whenever possible. Showing � when the byte decoding fails (behavior checked against LlamaTokenizer in `transformers`. * Update rustdoc. * Clippy + Add BPE(byte_fallback) into bindings. * Stupid file. * Test artifacts removed. * Update stub. * Fix. * Bad file. * CRITICAL FIX: wrapper order because of untagged.... * Remove prints. * Fixing <16 byte fallback.	2023-03-23 16:04:32 +01:00
Quentin Lhoest	e76f900bc0	Faster `datasets` train example Using .iter() is much faster than accessing using row ids	2023-03-23 11:24:30 +01:00
Roy Hvaara	b8fbea00a9	Bump dirs from 3.0 to 4.0 (#1142 )	2023-03-21 10:32:02 +01:00
Nicolas Patry	5ecd329503	Fixing infinite loop in UnigramTrainer. (#1182 ) * Fixing infinite loop in UnigramTrainer. * Newer clippy.	2023-03-15 14:59:01 +01:00
dependabot[bot]	9c0e700212	Bump webpack in /tokenizers/examples/unstable_wasm/www (#1181 ) Bumps [webpack](https://github.com/webpack/webpack) from 5.75.0 to 5.76.0. - [Release notes](https://github.com/webpack/webpack/releases) - [Commits](https://github.com/webpack/webpack/compare/v5.75.0...v5.76.0) --- updated-dependencies: - dependency-name: webpack dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-03-15 10:54:26 +01:00
mert-kurttutan	5c18ec5ff5	pyo3 v0.18 migration (#1173 ) * pyo v0.18 migration * Fix formatting issues of black	2023-03-08 11:27:47 +01:00
Nicolas Patry	3138657565	Using clippy 1.67 (#1167 )	2023-03-02 12:28:39 +01:00
Thomas Wang	ac552ff8b9	Update model.rs (#1166 )	2023-02-28 17:35:57 +01:00
Nicolas Patry	fa66caf0ab	Improved version. (#1154 ) * Improved version. * Clippy.	2023-01-23 16:35:19 +01:00
Nicolas Patry	d09241fba1	Prevent using `from_pretrained` on invalid ids (better error message). (#1153 )	2023-01-23 15:38:14 +01:00
Nicolas Patry	b861d48b06	Making `Tokenizer` clone. (#1152 )	2023-01-23 10:12:35 +01:00
mert-kurttutan	1fcd90b0b7	Update info on environment variable for threading (#1150 ) * Update env var name for threading * Update env var name for threading	2023-01-22 21:24:41 +01:00
Andrew Kane	33a57e6418	Made dirs optional (#1148 )	2023-01-18 09:29:15 +01:00
Nicolas Patry	daf8aebd76	Adding python 3.8 for M1 (#1147 )	2023-01-16 16:40:46 +01:00
Nicolas Patry	5a94a2b6e7	Add missing build targets (#1145 ) * M1 3.11 was not out neither windows amd64. * python@v4. * Actually upload. * Update needs. * Preparing the actual PR.	2023-01-15 10:18:08 +01:00
dependabot[bot]	fe4ae7dc38	Bump json5 from 2.2.0 to 2.2.3 in /bindings/node (#1140 ) Bumps [json5](https://github.com/json5/json5) from 2.2.0 to 2.2.3. - [Release notes](https://github.com/json5/json5/releases) - [Changelog](https://github.com/json5/json5/blob/main/CHANGELOG.md) - [Commits](https://github.com/json5/json5/compare/v2.2.0...v2.2.3) --- updated-dependencies: - dependency-name: json5 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-01-03 11:50:51 +01:00
dependabot[bot]	c3fedd96b3	Bump json5, copy-webpack-plugin, webpack and webpack-cli (#1139 ) Removes [json5](https://github.com/json5/json5). It's no longer used after updating ancestor dependencies [json5](https://github.com/json5/json5), [copy-webpack-plugin](https://github.com/webpack-contrib/copy-webpack-plugin), [webpack](https://github.com/webpack/webpack) and [webpack-cli](https://github.com/webpack/webpack-cli). These dependencies need to be updated together. Removes `json5` Updates `copy-webpack-plugin` from 5.1.2 to 11.0.0 - [Release notes](https://github.com/webpack-contrib/copy-webpack-plugin/releases) - [Changelog](https://github.com/webpack-contrib/copy-webpack-plugin/blob/master/CHANGELOG.md) - [Commits](https://github.com/webpack-contrib/copy-webpack-plugin/compare/v5.1.2...v11.0.0) Updates `webpack` from 4.46.0 to 5.75.0 - [Release notes](https://github.com/webpack/webpack/releases) - [Commits](https://github.com/webpack/webpack/compare/v4.46.0...v5.75.0) Updates `webpack-cli` from 3.3.12 to 5.0.1 - [Release notes](https://github.com/webpack/webpack-cli/releases) - [Changelog](https://github.com/webpack/webpack-cli/blob/master/CHANGELOG.md) - [Commits](https://github.com/webpack/webpack-cli/compare/v3.3.12...webpack-cli@5.0.1) --- updated-dependencies: - dependency-name: json5 dependency-type: indirect - dependency-name: copy-webpack-plugin dependency-type: direct:development - dependency-name: webpack dependency-type: direct:development - dependency-name: webpack-cli dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-01-03 10:22:49 +01:00

1 2 3 4 5 ...

1672 Commits