tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-03 03:08:21 +00:00

Author	SHA1	Message	Date
Chris Ha	cefc41e8ec	implement a simple max_sentencepiece_length into BPE (#1228 ) * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * utilize Option<u16> for safer code. * Other version. * Update trainer.rs clarify with type usize propagate max_length option * change max_length into more descriptive name in the documentation https://huggingface.co/docs/tokenizers/api/trainers unigramtrainer uses max_piece_length for similar function. since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive. * change variable name in trainer.rs change max_merge_length into max_token_length * Update trainer.rs add several max_token_length declaration that were missing. impl BpeTrainerBuilder struct BpeTrainer Add explanation for variable shadowing. * Update trainer.rs Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly. * add max_token_length test * Add bpe direct assert test * Update trainer.rs clarified test documentation * Creating the bindings. * Fix the default. * Re-adding missing package-lock which I accidentally removed. * .. * Fixing trainer test. * Fix. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-16 10:08:19 +02:00
Nicolas Patry	ef5f50605d	Printing warning to stderr. (#1222 )	2023-04-19 14:55:24 +02:00
Arthur	ce244bd094	remove rc1	2023-04-04 16:19:42 +02:00
Nicolas Patry	1cb44bd180	New version 0.13.3	2023-04-04 14:14:17 +02:00
Nicolas Patry	3aaf4946b3	Add `content` to Strip decoder to allow decoding mid tokens. (#1199 ) * Add `content` to Strip decoder to allow decoding mid tokens. * Stub. * Clippy.	2023-03-24 10:14:49 +01:00
Nicolas Patry	e4aea890d5	Adding 2 new decoders: (#1196 ) * Adding 2 new decoders: - Fuse will simply concatenate all tokens into 1 string - Strip will remove n char from left or right Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want for the `Metaspace` thing. - Note: Added a new dependency from better parsing of decoders. This is due to untagged enums which can match anything the `MustBe` ensure there's no issue between Fuse and ByteFallback. Since both are new the chances for backward incompatibility is low. * Fixing picking/unpickling (using default args.). * Stub. * Black. * Fixing node.	2023-03-24 00:50:54 +01:00
Nicolas Patry	d2c8190a0f	Creating `normalizers.Prepend` (To be used instead of `Metaspace`). (#1194 ) * Creating `normalizers.Prepend` (To be used instead of `Metaspace`). * Linting + stub. * Fixing pickling/unpickling by setting a default. * Black.	2023-03-24 00:33:31 +01:00
Nicolas Patry	250d46c676	Adding `Replace` to decoder (to undo the Replace Normalizer for (#1195 ) Metaspace split).	2023-03-23 23:43:47 +01:00
Quentin Lhoest	178e294a6a	Merge pull request #1192 from huggingface/faster-datasets-train-example Faster `datasets` train example	2023-03-23 17:19:05 +01:00
Nicolas Patry	73637a0004	Adding ByteFallback support for `tokenizers`. (#1183 ) * Adding ByteFallback support for `tokenizers`. Two items added: - A flag `byte_fallback` for the `BPE` model. This will be in charge of using `<0x61>` instead of unk on unknown tokens. - A ByteFallback decoder, which will be in charge of putting everything back into string whenever possible. Showing � when the byte decoding fails (behavior checked against LlamaTokenizer in `transformers`. * Update rustdoc. * Clippy + Add BPE(byte_fallback) into bindings. * Stupid file. * Test artifacts removed. * Update stub. * Fix. * Bad file. * CRITICAL FIX: wrapper order because of untagged.... * Remove prints. * Fixing <16 byte fallback.	2023-03-23 16:04:32 +01:00
Quentin Lhoest	e76f900bc0	Faster `datasets` train example Using .iter() is much faster than accessing using row ids	2023-03-23 11:24:30 +01:00
mert-kurttutan	5c18ec5ff5	pyo3 v0.18 migration (#1173 ) * pyo v0.18 migration * Fix formatting issues of black	2023-03-08 11:27:47 +01:00
SeongBeomLEE	9b155b5723	[FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. (#1136 ) * [fix] Use unk_token In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. * [fix] If unk_token is None, this case is also considered. * Update bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * [FIX] In CharBPETokenizer, Use unk_token. In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. * Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-12-27 11:13:52 +01:00
Roy Hvaara	4d520c9664	Ignore Cargo.lock for subfolders (#1131 )	2022-12-25 11:35:47 +01:00
Roy Hvaara	fbad581128	Bump derive_builder from 0.9 to 0.12 (#1129 )	2022-12-23 23:37:16 +01:00
SeongBeomLEE	9a25b2cb8e	[FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. (#1120 ) * [fix] Use unk_token In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. * [fix] If unk_token is None, this case is also considered. * Update bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-12-19 13:40:04 +01:00
Nicolas Patry	bbae829a72	Adding rust audit. (#1099 ) * Adding rust audit. * Update clap version + derive_builder (they clashed). * Ignoring specific CVE which can be ignored https://github.com/Azure/iot-identity-service/issues/481 * Updating python lock. * Revert `derive-builder` update. * Adding back help msg.	2022-11-09 12:59:36 +01:00
Nicolas Patry	b8a4aa6000	Fixing extra wheels memory usage. (#1098 )	2022-11-07 09:11:18 +01:00
Cameron	11bb2e00f2	Add python 3.11 to manylinux buildwheels (#1096 ) * Add python 3.11 to manylinux buildwheels * Fixing clippy. * Node clippy. * Python clippy. * Changelog + version number update. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-11-07 08:45:04 +01:00
Nicolas Patry	96a9e5715c	New version. (#1082 ) * New version. The actual release will happen before PyO3 0.17.2 because the tests were ran before than. * Manylinux2014 necessary now with Rust 1.64.	2022-10-06 15:45:56 +02:00
David Hewitt	8129dd3309	pyo3: update to 0.17 (#1066 ) * python: update bindings to edition 2021 * python: update to pyo3 0.17 * Updating testing. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-10-05 16:59:01 +02:00
Nicolas Patry	6113666624	Updating python formatting. (#1079 ) * Updating python formatting. * Forgot gh action. * Skipping isort to prevent circular imports. * Updating stub. * Removing `isort` (it contradicts `stub.py`). * Fixing weird stub black/isort disagreeement.	2022-10-05 15:29:33 +02:00
Nicolas Patry	5f6e978452	Fixing roberta type id (everything is zero). (#1072 ) * Fixing roberta type ids (everything is zero). * We need to fix type_ids for all sequence even when not changing anything else. * Fixing tests hopefully better.	2022-09-26 18:00:41 +02:00
Nicolas Patry	6e5569a540	Moving versions numbers to `dev` mode. (#1067 )	2022-09-22 18:24:07 +02:00
Nicolas Patry	63082c4d11	Enabling static interpreter embedding for manylinux. (#1064 ) * Removing dead file. * Checking that we can distribute with static python embedding for manylinux * Many linux embed interpreter. * Building wheels manylinux with static embedding * Better script. * typo. * Using a dummy feature? * default features ? * Back into order. * Fixing manylinux ??. * Local dir. * Missing star. * Makedir ? * Monkey coding this. * extension module ? * Building with default features `RustExtension`. * bdist_wheel + rustextension any better ? * update rust-py version. * Forcing extension module. * No default features. * Remove py37 out of spite * Revert "Remove py37 out of spite" This reverts commit 6ab7facd792b59c2e30be82fe42816d24c32cf0d. * Really extraneous feature. * Fix build wheels. * Putting things back in place.	2022-09-21 12:18:46 +02:00
Nicolas Patry	655f4057b7	Removing python3.6 from manylinux it's not supported anymore. (#1063 )	2022-09-19 12:22:02 +02:00
Nicolas Patry	7bfab48979	Preparing rc1 release. (#1056 ) * Preparing rc1 release. * Fixing test_alignment_methods * Fixing the overflowing sequence_id issue (LayoutLMv2 tests caught this). * Adding overly complex overflowing test.	2022-09-12 16:07:06 +02:00
Nicolas Patry	06025e4ca1	Adding `Sequence` for `PostProcessor`. (#1052 ) * Adding `Sequence` for `PostProcessor`. * Fixing node? Writing in the dark here, don't have Python2.7 * `undefined` is not accepted. * Other test.	2022-08-25 14:50:06 +02:00
Nicolas Patry	460bdded80	Modify `Processor` trait to support chaining. (#1054 ) 0 modifications yet, everything will consume the vector. Every test should be green without any modifications.	2022-08-24 19:49:23 +02:00
Nicolas Patry	b1c9bc68b5	Updating code according to clippy. (#1048 ) - Adding `Eq` where possible - Denied the ref deref warnings as it was spamming and solution not really better.	2022-08-24 19:45:15 +02:00
Nicolas Patry	adf90dcd72	Adding `unstable_wasm` feature + example to run `tokenizers` on wasm. (#1009 ) * Adding `unstable_wasm` feature + example to run `tokenizers` on wasm. Co-Authored-By: josephrocca <1167575+josephrocca@users.noreply.github.com> Co-Authored-By: Matthias Brunel <matthias.brunel@mithrilsecurity.io> * Adding some serialization tests. * Updating with comments. Co-authored-by: josephrocca <1167575+josephrocca@users.noreply.github.com> Co-authored-by: Matthias Brunel <matthias.brunel@mithrilsecurity.io>	2022-06-10 14:58:02 +02:00
Nicolas Patry	943b5421aa	Changing `Decoder` trait to be more composable. (#938 ) (#1008 ) * Changing `Decoder` trait to be more composable. (#938) * Changing `Decoder` trait to be more composable. Fix #872 * Fixing Python side. * Fixing test. * Updating cleanup signature, removing turbofish. * Adding `Sequence` Decoder.	2022-06-02 14:43:42 +02:00
h-vetinari	519cc13be0	Upgrade pyo3 to 0.16 (#956 ) * Upgrade pyo3 to 0.15 Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com> * Upgrade pyo3 to 0.16 Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com> * Install Python before running cargo clippy * Fix clippy warnings * Use `PyArray_Check` instead of downcasting to `PyArray1<u8>` * Enable `auto-initialize` of pyo3 to fix `cargo test --no-default-features` * Fix some test cases Why do they change? * Refactor and add SAFETY comments to `PyArrayUnicode` Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData` Co-authored-by: messense <messense@icloud.com>	2022-05-05 15:48:40 +02:00
Mishig Davaadorj	e6cd73a291	`.dev0` suffix in python version (#987 )	2022-04-22 09:36:18 +02:00
Mishig Davaadorj	95b5d066d5	Update doc build gh workflow to install rust	2022-04-21 09:20:20 +02:00
Mishig Davaadorj	c2aa87a256	Add `setup.py` extras["dev"]	2022-04-19 15:14:44 +02:00
Nicolas Patry	66c9af26f6	Fixing the documentation for `ByteLevel` in Python (#982 ) * Fixing the documentation for `ByteLevel` in Python * Python stub.py (after rebuilding ofc).	2022-04-14 16:29:50 +02:00
Nicolas Patry	8a9bb28f46	Preparing for 0.12.1 (#978 ) * Preparing for 0.12.1 * Updated the changelog.	2022-04-12 17:57:33 +02:00
Nicolas Patry	ec43947786	Revert "Changing `Decoder` trait to be more composable. (#938 )" (#971 ) This reverts commit `cdabef14c4`.	2022-04-04 09:43:28 +02:00
Nicolas Patry	0eb7455fe5	Preparing `0.12` release. (#967 ) * Preparing `0.12` release. * Fix click version: https://github.com/psf/black/issues/2964	2022-03-31 11:06:33 +02:00
Nicolas Patry	a5f644616b	Fix the error test for Python 3.10 (error message is different). (#962 )	2022-03-23 10:35:58 +01:00
Nicolas Patry	cd730594e9	Fixing issue with ConvBert not being able to save because of of holes in (#954 ) the vocab.	2022-03-21 19:28:49 +01:00
Kaito Sugimoto	1bb9884f45	Fixing the vocab size of the trained Unigram model (#952 ) * Fixing the vocab size of the trained Unigram model * add test for the vocab size of the trained Unigram model * Revert "add test for the vocab size of the trained Unigram model" This reverts commit fb8955c831b357d1037548ceaa8789734d544646. * Fixing the vocab size of the trained Unigram model * format codes * get the position of vocab-size calculation out of loop	2022-03-18 18:13:17 +01:00
Nicolas Patry	daa4dd2288	Making the regex in ByteLevel optional. (#939 ) * Making the regex in ByteLevel optional. * Changed the stub. * Beter stub. * Typo fix. * Remove bad comments.	2022-03-18 09:03:20 +01:00
Nicolas Patry	cdabef14c4	Changing `Decoder` trait to be more composable. (#938 ) * Changing `Decoder` trait to be more composable. Fix #872 * Fixing Python side. * Fixing test. * Updating cleanup signature, removing turbofish.	2022-03-17 10:32:09 +01:00
Nicolas Patry	4b6055d4fb	Adding pickling support for trainers (#949 ) * TMP. * Adding support for pickling Python trainers. * Remove not warranted files + missed naming updates. * Stubbing. * Making sure serialized format is written in python tests.	2022-03-14 12:18:11 +01:00
dctelus	71ae5421eb	Python - add initial_alphabet to spm unigram trainer (#942 ) * Python - add initial_alphabet to spm unigram trainer * Python - use optional instead of mutable defaults in spm unigram trainer	2022-03-09 09:54:03 +01:00
dctelus	98249dfb0f	Python - add doctype to length in implementations spm unigram (#943 )	2022-03-08 11:59:07 +01:00
dctelus	4a8f5db067	Python - Add length to train_from_iterator in implementations (#937 )	2022-03-04 14:11:58 +01:00
Luc Georges	845da6d8e8	Feat/m1 manual build (#936 ) * feat(bindings): move target compilation flags to correct config file * feat(bindings): m1 build 'script' * feat(ci): for loop in bdist_wheel script for m1 Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-03-02 14:44:13 +01:00

1 2 3 4 5 ...

797 Commits