tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-03 03:08:21 +00:00

Author	SHA1	Message	Date
Arthur Zucker	2b72017e17	correctly compute the new id: we take the max of the AddedToken + get_vocab_size	2023-09-01 19:03:33 +00:00
Arthur Zucker	db319492f7	clippy	2023-09-01 18:57:39 +00:00
Arthur Zucker	2dca476810	fix some tests	2023-09-01 18:48:50 +00:00
Arthur Zucker	6cca5716af	fix one test?	2023-09-01 18:42:30 +00:00
Arthur Zucker	345b4eba96	updates	2023-09-01 18:41:36 +00:00
Nicolas Patry	8e522a38d9	Updating the docs with the new command. (#1333 )	2023-08-29 13:15:26 +02:00
Nicolas Patry	d2010d5165	Move to maturing mimicking move for `safetensors`. + Rewritten node bindings. (#1331 ) * Move to maturing mimicking move for `safetensors`. * Tmp. * Fix sdist. * Wat? * Clippy 1.72 * Remove if. * Conda sed. * Fix doc check workflow. * Moving to maturin AND removing http + openssl mess (smoothing transition moving to `huggingface_hub`) * Fix dep * Black. * New node bindings. * Fix docs + node cache ? * Yarn. * Working dir. * Extension module. * Put back interpreter. * Remove cache. * New attempt * Multi python. * Remove FromPretrained. * Remove traces of `fromPretrained`. * Drop 3.12 for windows? * Typo. * Put back the default feature for ignoring links during simple test. * Fix ? * x86_64 -> x64. * Remove warning for windows bindings. * Excluse aarch. * Include/exclude. * Put back workflows in correct states.	2023-08-28 16:24:14 +02:00
Nicolas Patry	f2952020d5	Python 38 arm (#1330 )	2023-08-23 16:29:16 +02:00
Nicolas Patry	f08058ab2b	Reduce number of different revisions by 1 (#1329 )	2023-08-23 15:57:36 +02:00
Nicolas Patry	6c350d88fe	Re-using scritpts from safetensors. (#1328 )	2023-08-23 15:37:38 +02:00
Arthur	d0bb35d5a6	Merge pull request #1316 from boyleconnor/add-expect-for-no-truncation Add `expect()` for disabling truncation	2023-08-18 19:30:53 +02:00
Michael Lui	540bf2eb01	pyo3: update to 0.19 (#1322 ) * Bump pyo3 dependency versions * Fix deprecation warnings from pyo3 --------- Co-authored-by: Mike Lui <mikelui@meta.com>	2023-08-16 18:40:32 +02:00
Nicolas Patry	9a93c50c25	Fix stride condition. (#1321 ) * Release all at once for simplicity. * rc2	2023-08-14 15:27:55 +02:00
Nicolas Patry	b35d33f981	Release all at once for simplicity. (#1320 )	2023-08-14 13:49:45 +02:00
Nicolas Patry	fb292d1eae	0.13.4.rc1 (#1319 )	2023-08-14 12:06:43 +02:00
Chris Ha	862046ac94	CD backports (#1318 ) * CD backports follow huggingface/safetensors#317 * fix node bindings? `cargo check` doesnt work on my local configuration from `tokenizers/bindings/node/native` i don't think it will be a problem but i have difficulty telling * backport #315 * safetensors#317 back ports	2023-08-10 18:52:22 +02:00
Connor Boyle	748556a9ed	Fix code style	2023-08-07 15:17:43 -07:00
Jonatan Kłosko	d47d3e377c	Derive clone for TrainerWrapper (#1317 )	2023-08-07 15:15:10 +02:00
Connor Boyle	a0a8ebe03f	Add `expect()` for disabling truncation	2023-08-06 13:25:50 -07:00
Kelly Marchisio	efea6c7246	Handle when precompiled charsmap is empty (#1308 ) * Handle when precompiled charsmap is empty * Black --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-07-31 14:35:24 +02:00
Connor Boyle	c2664ae13f	Give error when initializing tokenizer with too high stride (#1306 ) * Split `get_n_added_tokens` into separate method * Modify `TokenizerImpl.with_truncation()` to raise an error if given bad parameters * Return Python error if `tokenizer.with_truncation()` fails * Add dummy variable assignment for `no_truncation()` case * Unrelated fmt fix. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-07-28 09:16:44 +02:00
Nicolas Patry	bb38f390a6	Single warning for holes. (#1303 ) * Single warning for holes. * Dummy.	2023-07-25 12:57:23 +02:00
Samuel Larkin	d6326b2b88	feat: Added CITATION.cff. (#1302 )	2023-07-25 12:16:09 +02:00
dependabot[bot]	ea4d3f634c	Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node (#1299 ) Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.4. - [Release notes](https://github.com/jonschlinkert/word-wrap/releases) - [Commits](https://github.com/jonschlinkert/word-wrap/compare/1.2.3...1.2.4) --- updated-dependencies: - dependency-name: word-wrap dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-21 08:08:10 +02:00
Nicolas Patry	291b2e23ae	Fixing clippy warnings on 1.71. (#1296 ) * Fixing clippy warnings on 1.71. * Fix. * Fmt. * Python clippy. * Should really set my env back again. * Fix.	2023-07-16 15:58:38 +02:00
Kelly Marchisio	4811f769a1	import Tuple from typing (#1295 )	2023-07-14 17:39:29 +02:00
Arthit Suriyawongkul	150559b61e	master -> main (#1292 )	2023-07-12 11:51:22 +02:00
dependabot[bot]	92bfb9c993	Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node (#1291 ) Bumps [tough-cookie](https://github.com/salesforce/tough-cookie) from 4.0.0 to 4.1.3. - [Release notes](https://github.com/salesforce/tough-cookie/releases) - [Changelog](https://github.com/salesforce/tough-cookie/blob/master/CHANGELOG.md) - [Commits](https://github.com/salesforce/tough-cookie/compare/v4.0.0...v4.1.3) --- updated-dependencies: - dependency-name: tough-cookie dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-10 09:44:31 +02:00
Hiroshi Matsuda	26659de473	revise type specification (#1289 )	2023-07-06 16:36:48 +02:00
Arthur	864135bef1	Add unigram bytefallback (#1217 ) * current updates will go red * cargo fmt * npm install * refactor train for unigram to allow bytefallbakc (breaking) * fmt * nits * update * add a proper test * fix encode optimised fallback + add trainer arg * fixes * fixes * fix tests * add test * fmt * fix rust test * update python bindings * update * pub is okay and needed * more fix * cleanup * remove useles id * MissingUnkId error * nits * fix offset * add a test in python * update src bindings * remove bytefallback from trainer * styling * update pckg * lint * fmt * stup with dev * update code based on review * remove unused function * udpate python test to compare ids * fix option bool issues * final fix * clippy * fix npm isntall * update * update test * more in depth testing * Lint * last attempt to fix node * update node bindings * fmt * Update tokenizers/src/models/unigram/model.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * update based on review * simpler test * lint --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-06-26 10:46:59 +02:00
Connor Boyle	8c9cfb0b68	Improve error for truncation with too high stride (#1275 )	2023-06-12 10:38:42 +02:00
Mishig	348ed70e58	[doc build] Use secrets (#1273 )	2023-06-09 12:58:27 +02:00
Santosh Bhavani	5d70f15bfb	Update README.md - Broken link (#1272 ) * Update README.md - Broken link fixed "python documentation" link * Update README.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2023-06-08 10:20:11 +02:00
Chris Ha	f85e8467e4	Update Cargo.toml (#1266 ) `cargo update` yield the following Updating regex v1.8.1 -> v1.8.3 Updating regex-syntax v0.7.1 -> v0.7.2	2023-06-07 09:57:18 +02:00
Chris Ha	cb8d4de599	fix documentation regarding regex (#1264 ) * fix documentation regarding regex Split() in pre_tokenizers.rs and normalizations take a regex that is required to be built with a tokenizer specific regex module. Clarify this in the documentation. * Update __init__.pyi fixed __init__.pyi * Update bindings/python/py_src/tokenizers/__init__.pyi Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update bindings/python/py_src/tokenizers/__init__.pyi Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Revert "Update bindings/python/py_src/tokenizers/__init__.pyi" This reverts commit 6e8bdfcddf67bcdd8e3b1a78685fd5ef8f6a153c. * Revert "Update bindings/python/py_src/tokenizers/__init__.pyi" This reverts commit 897b0c0de471ad7cb6269b8456347c4e5cff2aaf. * Revert "Update __init__.pyi" This reverts commit fbe82310b7728ee7cdb6f8b38fbc2388f9d95771. * add codeblocks the right way * add codeblocks with stub.py ran setup.py install to build, and then ran stub.py --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2023-06-07 09:41:28 +02:00
Nicolas Patry	c7102c4c0f	Fixing broken link. (#1268 )	2023-06-06 11:10:28 +02:00
Chris Ha	cb819724ef	Update trainer.rs (#1257 ) implement skip for empty sentences. refer to : https://github.com/google/sentencepiece/blob/master/src/trainer_interface.cc#L373	2023-05-25 12:24:29 +02:00
Mishig	fc76ad4f07	Parallelize unigram trainer (#976 ) * Parallelize unigram trainer Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Rm unused lifetime --------- Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>	2023-05-22 15:36:03 +02:00
Funtowicz Morgan	a03330607b	Update all GH Actions with dependency on actions/checkout from v[1,2] to v3 to notably improve performance (retrieve only the commit being checked-out) (#1256 )	2023-05-22 14:50:00 +02:00
Funtowicz Morgan	b4fcc9ce6e	Makes `decode` and `decode_batch` work on borrowed content. (#1251 ) * Makes `decode` and `decode_batch` work on borrowed content. * Make `decode_batch` work with borrowed content. * Fix lint. * Attempt to map it into Node. * Second attempt. * Step by step. * One more step. * Fix lint. * Please ... * Removing collect. * Revert "Removing collect." This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-17 11:18:15 +02:00
Chris Ha	cefc41e8ec	implement a simple max_sentencepiece_length into BPE (#1228 ) * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * utilize Option<u16> for safer code. * Other version. * Update trainer.rs clarify with type usize propagate max_length option * change max_length into more descriptive name in the documentation https://huggingface.co/docs/tokenizers/api/trainers unigramtrainer uses max_piece_length for similar function. since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive. * change variable name in trainer.rs change max_merge_length into max_token_length * Update trainer.rs add several max_token_length declaration that were missing. impl BpeTrainerBuilder struct BpeTrainer Add explanation for variable shadowing. * Update trainer.rs Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly. * add max_token_length test * Add bpe direct assert test * Update trainer.rs clarified test documentation * Creating the bindings. * Fix the default. * Re-adding missing package-lock which I accidentally removed. * .. * Fixing trainer test. * Fix. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-16 10:08:19 +02:00
Nicolas Patry	daf3fcc976	Rvert main hiccup.	2023-05-15 18:01:29 +02:00
Nicolas Patry	b58227c7f1	Never gonna make you cry	2023-05-12 16:28:57 +02:00
Nicolas Patry	02ad59edc1	Never gonna run around and desert you	2023-05-12 16:27:06 +02:00
Nicolas Patry	8d07696c38	Never gonna let you down	2023-05-12 16:24:26 +02:00
Chris Ha	4518b0f7f2	fix unigram.rs test_sample() (#1244 ) `87230bb59b/tokenizers/tests/unigram.rs (LL71C1-L71C53)` When running cargo test --release, the above line causes an error. referring to `87230bb59b/tokenizers/src/models/unigram/lattice.rs (L138)` It seems that lattice::from should only take 3 arguments. If i had to guess, it should be Lattice::from("ABC", 0, 2); This change makes cargo test --release pass without error.	2023-05-10 17:04:34 +02:00
Kornél Csernai	87230bb59b	use LTO for release and benchmark builds (#1157 )	2023-05-09 16:15:57 +02:00
Nicolas Patry	15085ef905	Fixing padding_left sequence_ids. (#1233 )	2023-05-04 15:57:20 +02:00
Nicolas Patry	ef5f50605d	Printing warning to stderr. (#1222 )	2023-04-19 14:55:24 +02:00
Arthur	d19bc63c67	Merge pull request #1212 from huggingface/fix-node-release Fix node release	2023-04-06 16:25:29 +02:00

1 2 3 4 5 ...

1703 Commits