tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-23 16:49:27 +00:00

Author	SHA1	Message	Date
Arthur Zucker	058e34b421	make special editable as well	2023-09-04 20:54:29 +00:00
Arthur Zucker	2291c89896	`python stub.py`	2023-09-04 19:49:36 +00:00
Arthur Zucker	c599db1421	nits	2023-09-04 19:11:19 +00:00
Arthur Zucker	d4008b0d7a	cliipy	2023-09-04 19:11:05 +00:00
Arthur Zucker	b117ac7f16	updates	2023-09-04 19:10:22 +00:00
Arthur Zucker	a53dff9bc5	make content writable in python	2023-09-04 18:18:21 +00:00
Arthur Zucker	d9829cdc6e	fix more tests	2023-09-04 17:22:27 +00:00
Arthur Zucker	39bd27e673	fix build	2023-09-01 21:22:07 +00:00
Arthur Zucker	9f0c703f03	update init and src for bingings python	2023-09-01 21:07:01 +00:00
Arthur Zucker	345b4eba96	updates	2023-09-01 18:41:36 +00:00
Nicolas Patry	8e522a38d9	Updating the docs with the new command. (#1333 )	2023-08-29 13:15:26 +02:00
Nicolas Patry	d2010d5165	Move to maturing mimicking move for `safetensors`. + Rewritten node bindings. (#1331 ) * Move to maturing mimicking move for `safetensors`. * Tmp. * Fix sdist. * Wat? * Clippy 1.72 * Remove if. * Conda sed. * Fix doc check workflow. * Moving to maturin AND removing http + openssl mess (smoothing transition moving to `huggingface_hub`) * Fix dep * Black. * New node bindings. * Fix docs + node cache ? * Yarn. * Working dir. * Extension module. * Put back interpreter. * Remove cache. * New attempt * Multi python. * Remove FromPretrained. * Remove traces of `fromPretrained`. * Drop 3.12 for windows? * Typo. * Put back the default feature for ignoring links during simple test. * Fix ? * x86_64 -> x64. * Remove warning for windows bindings. * Excluse aarch. * Include/exclude. * Put back workflows in correct states.	2023-08-28 16:24:14 +02:00
Nicolas Patry	f08058ab2b	Reduce number of different revisions by 1 (#1329 )	2023-08-23 15:57:36 +02:00
Arthur	d0bb35d5a6	Merge pull request #1316 from boyleconnor/add-expect-for-no-truncation Add `expect()` for disabling truncation	2023-08-18 19:30:53 +02:00
Michael Lui	540bf2eb01	pyo3: update to 0.19 (#1322 ) * Bump pyo3 dependency versions * Fix deprecation warnings from pyo3 --------- Co-authored-by: Mike Lui <mikelui@meta.com>	2023-08-16 18:40:32 +02:00
Nicolas Patry	9a93c50c25	Fix stride condition. (#1321 ) * Release all at once for simplicity. * rc2	2023-08-14 15:27:55 +02:00
Nicolas Patry	fb292d1eae	0.13.4.rc1 (#1319 )	2023-08-14 12:06:43 +02:00
Connor Boyle	748556a9ed	Fix code style	2023-08-07 15:17:43 -07:00
Connor Boyle	a0a8ebe03f	Add `expect()` for disabling truncation	2023-08-06 13:25:50 -07:00
Kelly Marchisio	efea6c7246	Handle when precompiled charsmap is empty (#1308 ) * Handle when precompiled charsmap is empty * Black --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-07-31 14:35:24 +02:00
Connor Boyle	c2664ae13f	Give error when initializing tokenizer with too high stride (#1306 ) * Split `get_n_added_tokens` into separate method * Modify `TokenizerImpl.with_truncation()` to raise an error if given bad parameters * Return Python error if `tokenizer.with_truncation()` fails * Add dummy variable assignment for `no_truncation()` case * Unrelated fmt fix. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-07-28 09:16:44 +02:00
Nicolas Patry	291b2e23ae	Fixing clippy warnings on 1.71. (#1296 ) * Fixing clippy warnings on 1.71. * Fix. * Fmt. * Python clippy. * Should really set my env back again. * Fix.	2023-07-16 15:58:38 +02:00
Kelly Marchisio	4811f769a1	import Tuple from typing (#1295 )	2023-07-14 17:39:29 +02:00
Hiroshi Matsuda	26659de473	revise type specification (#1289 )	2023-07-06 16:36:48 +02:00
Arthur	864135bef1	Add unigram bytefallback (#1217 ) * current updates will go red * cargo fmt * npm install * refactor train for unigram to allow bytefallbakc (breaking) * fmt * nits * update * add a proper test * fix encode optimised fallback + add trainer arg * fixes * fixes * fix tests * add test * fmt * fix rust test * update python bindings * update * pub is okay and needed * more fix * cleanup * remove useles id * MissingUnkId error * nits * fix offset * add a test in python * update src bindings * remove bytefallback from trainer * styling * update pckg * lint * fmt * stup with dev * update code based on review * remove unused function * udpate python test to compare ids * fix option bool issues * final fix * clippy * fix npm isntall * update * update test * more in depth testing * Lint * last attempt to fix node * update node bindings * fmt * Update tokenizers/src/models/unigram/model.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * update based on review * simpler test * lint --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-06-26 10:46:59 +02:00
Chris Ha	cb8d4de599	fix documentation regarding regex (#1264 ) * fix documentation regarding regex Split() in pre_tokenizers.rs and normalizations take a regex that is required to be built with a tokenizer specific regex module. Clarify this in the documentation. * Update __init__.pyi fixed __init__.pyi * Update bindings/python/py_src/tokenizers/__init__.pyi Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update bindings/python/py_src/tokenizers/__init__.pyi Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Revert "Update bindings/python/py_src/tokenizers/__init__.pyi" This reverts commit 6e8bdfcddf67bcdd8e3b1a78685fd5ef8f6a153c. * Revert "Update bindings/python/py_src/tokenizers/__init__.pyi" This reverts commit 897b0c0de471ad7cb6269b8456347c4e5cff2aaf. * Revert "Update __init__.pyi" This reverts commit fbe82310b7728ee7cdb6f8b38fbc2388f9d95771. * add codeblocks the right way * add codeblocks with stub.py ran setup.py install to build, and then ran stub.py --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2023-06-07 09:41:28 +02:00
Funtowicz Morgan	b4fcc9ce6e	Makes `decode` and `decode_batch` work on borrowed content. (#1251 ) * Makes `decode` and `decode_batch` work on borrowed content. * Make `decode_batch` work with borrowed content. * Fix lint. * Attempt to map it into Node. * Second attempt. * Step by step. * One more step. * Fix lint. * Please ... * Removing collect. * Revert "Removing collect." This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-17 11:18:15 +02:00
Chris Ha	cefc41e8ec	implement a simple max_sentencepiece_length into BPE (#1228 ) * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * utilize Option<u16> for safer code. * Other version. * Update trainer.rs clarify with type usize propagate max_length option * change max_length into more descriptive name in the documentation https://huggingface.co/docs/tokenizers/api/trainers unigramtrainer uses max_piece_length for similar function. since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive. * change variable name in trainer.rs change max_merge_length into max_token_length * Update trainer.rs add several max_token_length declaration that were missing. impl BpeTrainerBuilder struct BpeTrainer Add explanation for variable shadowing. * Update trainer.rs Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly. * add max_token_length test * Add bpe direct assert test * Update trainer.rs clarified test documentation * Creating the bindings. * Fix the default. * Re-adding missing package-lock which I accidentally removed. * .. * Fixing trainer test. * Fix. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-16 10:08:19 +02:00
Nicolas Patry	ef5f50605d	Printing warning to stderr. (#1222 )	2023-04-19 14:55:24 +02:00
Arthur	ce244bd094	remove rc1	2023-04-04 16:19:42 +02:00
Nicolas Patry	1cb44bd180	New version 0.13.3	2023-04-04 14:14:17 +02:00
Nicolas Patry	3aaf4946b3	Add `content` to Strip decoder to allow decoding mid tokens. (#1199 ) * Add `content` to Strip decoder to allow decoding mid tokens. * Stub. * Clippy.	2023-03-24 10:14:49 +01:00
Nicolas Patry	e4aea890d5	Adding 2 new decoders: (#1196 ) * Adding 2 new decoders: - Fuse will simply concatenate all tokens into 1 string - Strip will remove n char from left or right Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want for the `Metaspace` thing. - Note: Added a new dependency from better parsing of decoders. This is due to untagged enums which can match anything the `MustBe` ensure there's no issue between Fuse and ByteFallback. Since both are new the chances for backward incompatibility is low. * Fixing picking/unpickling (using default args.). * Stub. * Black. * Fixing node.	2023-03-24 00:50:54 +01:00
Nicolas Patry	d2c8190a0f	Creating `normalizers.Prepend` (To be used instead of `Metaspace`). (#1194 ) * Creating `normalizers.Prepend` (To be used instead of `Metaspace`). * Linting + stub. * Fixing pickling/unpickling by setting a default. * Black.	2023-03-24 00:33:31 +01:00
Nicolas Patry	250d46c676	Adding `Replace` to decoder (to undo the Replace Normalizer for (#1195 ) Metaspace split).	2023-03-23 23:43:47 +01:00
Quentin Lhoest	178e294a6a	Merge pull request #1192 from huggingface/faster-datasets-train-example Faster `datasets` train example	2023-03-23 17:19:05 +01:00
Nicolas Patry	73637a0004	Adding ByteFallback support for `tokenizers`. (#1183 ) * Adding ByteFallback support for `tokenizers`. Two items added: - A flag `byte_fallback` for the `BPE` model. This will be in charge of using `<0x61>` instead of unk on unknown tokens. - A ByteFallback decoder, which will be in charge of putting everything back into string whenever possible. Showing � when the byte decoding fails (behavior checked against LlamaTokenizer in `transformers`. * Update rustdoc. * Clippy + Add BPE(byte_fallback) into bindings. * Stupid file. * Test artifacts removed. * Update stub. * Fix. * Bad file. * CRITICAL FIX: wrapper order because of untagged.... * Remove prints. * Fixing <16 byte fallback.	2023-03-23 16:04:32 +01:00
Quentin Lhoest	e76f900bc0	Faster `datasets` train example Using .iter() is much faster than accessing using row ids	2023-03-23 11:24:30 +01:00
mert-kurttutan	5c18ec5ff5	pyo3 v0.18 migration (#1173 ) * pyo v0.18 migration * Fix formatting issues of black	2023-03-08 11:27:47 +01:00
SeongBeomLEE	9b155b5723	[FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. (#1136 ) * [fix] Use unk_token In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. * [fix] If unk_token is None, this case is also considered. * Update bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * [FIX] In CharBPETokenizer, Use unk_token. In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. * Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-12-27 11:13:52 +01:00
Roy Hvaara	4d520c9664	Ignore Cargo.lock for subfolders (#1131 )	2022-12-25 11:35:47 +01:00
Roy Hvaara	fbad581128	Bump derive_builder from 0.9 to 0.12 (#1129 )	2022-12-23 23:37:16 +01:00
SeongBeomLEE	9a25b2cb8e	[FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. (#1120 ) * [fix] Use unk_token In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. * [fix] If unk_token is None, this case is also considered. * Update bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-12-19 13:40:04 +01:00
Nicolas Patry	bbae829a72	Adding rust audit. (#1099 ) * Adding rust audit. * Update clap version + derive_builder (they clashed). * Ignoring specific CVE which can be ignored https://github.com/Azure/iot-identity-service/issues/481 * Updating python lock. * Revert `derive-builder` update. * Adding back help msg.	2022-11-09 12:59:36 +01:00
Nicolas Patry	b8a4aa6000	Fixing extra wheels memory usage. (#1098 )	2022-11-07 09:11:18 +01:00
Cameron	11bb2e00f2	Add python 3.11 to manylinux buildwheels (#1096 ) * Add python 3.11 to manylinux buildwheels * Fixing clippy. * Node clippy. * Python clippy. * Changelog + version number update. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-11-07 08:45:04 +01:00
Nicolas Patry	96a9e5715c	New version. (#1082 ) * New version. The actual release will happen before PyO3 0.17.2 because the tests were ran before than. * Manylinux2014 necessary now with Rust 1.64.	2022-10-06 15:45:56 +02:00
David Hewitt	8129dd3309	pyo3: update to 0.17 (#1066 ) * python: update bindings to edition 2021 * python: update to pyo3 0.17 * Updating testing. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-10-05 16:59:01 +02:00
Nicolas Patry	6113666624	Updating python formatting. (#1079 ) * Updating python formatting. * Forgot gh action. * Skipping isort to prevent circular imports. * Updating stub. * Removing `isort` (it contradicts `stub.py`). * Fixing weird stub black/isort disagreeement.	2022-10-05 15:29:33 +02:00
Nicolas Patry	5f6e978452	Fixing roberta type id (everything is zero). (#1072 ) * Fixing roberta type ids (everything is zero). * We need to fix type_ids for all sequence even when not changing anything else. * Fixing tests hopefully better.	2022-09-26 18:00:41 +02:00

1 2 3 4 5 ...

724 Commits