tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-16 17:18:43 +00:00

Author	SHA1	Message	Date
Arthur Zucker	058e34b421	make special editable as well	2023-09-04 20:54:29 +00:00
Arthur Zucker	d4008b0d7a	cliipy	2023-09-04 19:11:05 +00:00
Arthur Zucker	b117ac7f16	updates	2023-09-04 19:10:22 +00:00
Arthur Zucker	a53dff9bc5	make content writable in python	2023-09-04 18:18:21 +00:00
Arthur Zucker	d9829cdc6e	fix more tests	2023-09-04 17:22:27 +00:00
Arthur	864135bef1	Add unigram bytefallback (#1217 ) * current updates will go red * cargo fmt * npm install * refactor train for unigram to allow bytefallbakc (breaking) * fmt * nits * update * add a proper test * fix encode optimised fallback + add trainer arg * fixes * fixes * fix tests * add test * fmt * fix rust test * update python bindings * update * pub is okay and needed * more fix * cleanup * remove useles id * MissingUnkId error * nits * fix offset * add a test in python * update src bindings * remove bytefallback from trainer * styling * update pckg * lint * fmt * stup with dev * update code based on review * remove unused function * udpate python test to compare ids * fix option bool issues * final fix * clippy * fix npm isntall * update * update test * more in depth testing * Lint * last attempt to fix node * update node bindings * fmt * Update tokenizers/src/models/unigram/model.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * update based on review * simpler test * lint --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-06-26 10:46:59 +02:00
Chris Ha	cefc41e8ec	implement a simple max_sentencepiece_length into BPE (#1228 ) * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * utilize Option<u16> for safer code. * Other version. * Update trainer.rs clarify with type usize propagate max_length option * change max_length into more descriptive name in the documentation https://huggingface.co/docs/tokenizers/api/trainers unigramtrainer uses max_piece_length for similar function. since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive. * change variable name in trainer.rs change max_merge_length into max_token_length * Update trainer.rs add several max_token_length declaration that were missing. impl BpeTrainerBuilder struct BpeTrainer Add explanation for variable shadowing. * Update trainer.rs Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly. * add max_token_length test * Add bpe direct assert test * Update trainer.rs clarified test documentation * Creating the bindings. * Fix the default. * Re-adding missing package-lock which I accidentally removed. * .. * Fixing trainer test. * Fix. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-16 10:08:19 +02:00
Nicolas Patry	3aaf4946b3	Add `content` to Strip decoder to allow decoding mid tokens. (#1199 ) * Add `content` to Strip decoder to allow decoding mid tokens. * Stub. * Clippy.	2023-03-24 10:14:49 +01:00
Nicolas Patry	e4aea890d5	Adding 2 new decoders: (#1196 ) * Adding 2 new decoders: - Fuse will simply concatenate all tokens into 1 string - Strip will remove n char from left or right Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want for the `Metaspace` thing. - Note: Added a new dependency from better parsing of decoders. This is due to untagged enums which can match anything the `MustBe` ensure there's no issue between Fuse and ByteFallback. Since both are new the chances for backward incompatibility is low. * Fixing picking/unpickling (using default args.). * Stub. * Black. * Fixing node.	2023-03-24 00:50:54 +01:00
Nicolas Patry	d2c8190a0f	Creating `normalizers.Prepend` (To be used instead of `Metaspace`). (#1194 ) * Creating `normalizers.Prepend` (To be used instead of `Metaspace`). * Linting + stub. * Fixing pickling/unpickling by setting a default. * Black.	2023-03-24 00:33:31 +01:00
Nicolas Patry	250d46c676	Adding `Replace` to decoder (to undo the Replace Normalizer for (#1195 ) Metaspace split).	2023-03-23 23:43:47 +01:00
Nicolas Patry	73637a0004	Adding ByteFallback support for `tokenizers`. (#1183 ) * Adding ByteFallback support for `tokenizers`. Two items added: - A flag `byte_fallback` for the `BPE` model. This will be in charge of using `<0x61>` instead of unk on unknown tokens. - A ByteFallback decoder, which will be in charge of putting everything back into string whenever possible. Showing � when the byte decoding fails (behavior checked against LlamaTokenizer in `transformers`. * Update rustdoc. * Clippy + Add BPE(byte_fallback) into bindings. * Stupid file. * Test artifacts removed. * Update stub. * Fix. * Bad file. * CRITICAL FIX: wrapper order because of untagged.... * Remove prints. * Fixing <16 byte fallback.	2023-03-23 16:04:32 +01:00
mert-kurttutan	5c18ec5ff5	pyo3 v0.18 migration (#1173 ) * pyo v0.18 migration * Fix formatting issues of black	2023-03-08 11:27:47 +01:00
Nicolas Patry	6113666624	Updating python formatting. (#1079 ) * Updating python formatting. * Forgot gh action. * Skipping isort to prevent circular imports. * Updating stub. * Removing `isort` (it contradicts `stub.py`). * Fixing weird stub black/isort disagreeement.	2022-10-05 15:29:33 +02:00
Nicolas Patry	06025e4ca1	Adding `Sequence` for `PostProcessor`. (#1052 ) * Adding `Sequence` for `PostProcessor`. * Fixing node? Writing in the dark here, don't have Python2.7 * `undefined` is not accepted. * Other test.	2022-08-25 14:50:06 +02:00
Nicolas Patry	943b5421aa	Changing `Decoder` trait to be more composable. (#938 ) (#1008 ) * Changing `Decoder` trait to be more composable. (#938) * Changing `Decoder` trait to be more composable. Fix #872 * Fixing Python side. * Fixing test. * Updating cleanup signature, removing turbofish. * Adding `Sequence` Decoder.	2022-06-02 14:43:42 +02:00
Nicolas Patry	ec43947786	Revert "Changing `Decoder` trait to be more composable. (#938 )" (#971 ) This reverts commit `cdabef14c4`.	2022-04-04 09:43:28 +02:00
Nicolas Patry	a5f644616b	Fix the error test for Python 3.10 (error message is different). (#962 )	2022-03-23 10:35:58 +01:00
Kaito Sugimoto	1bb9884f45	Fixing the vocab size of the trained Unigram model (#952 ) * Fixing the vocab size of the trained Unigram model * add test for the vocab size of the trained Unigram model * Revert "add test for the vocab size of the trained Unigram model" This reverts commit fb8955c831b357d1037548ceaa8789734d544646. * Fixing the vocab size of the trained Unigram model * format codes * get the position of vocab-size calculation out of loop	2022-03-18 18:13:17 +01:00
Nicolas Patry	cdabef14c4	Changing `Decoder` trait to be more composable. (#938 ) * Changing `Decoder` trait to be more composable. Fix #872 * Fixing Python side. * Fixing test. * Updating cleanup signature, removing turbofish.	2022-03-17 10:32:09 +01:00
Nicolas Patry	4b6055d4fb	Adding pickling support for trainers (#949 ) * TMP. * Adding support for pickling Python trainers. * Remove not warranted files + missed naming updates. * Stubbing. * Making sure serialized format is written in python tests.	2022-03-14 12:18:11 +01:00
Nicolas Patry	1a84958cc8	Fixing bad deserialization following inclusion of a default for `Punctuation`. (#884 ) * Fixing bad deserialization following inclusion of a default for `Punctuation`. * don't remove the type now... * Adding slow test to run on all the tokenizers of the hub. * `PartialEq` everywhere. * Forcing `type` to exist on the `pre_tokenizers`.	2022-01-17 22:28:25 +01:00
Nicolas Patry	1054e243e2	Fix invalid continuing subwrd prefix. (#864 ) * Creating failing test for invalid continuing subwrd prefix. * Test in rust + the associated fix. * Clippy. * Black.	2022-01-04 14:25:35 +01:00
Nicolas Patry	152880ab3e	Adding truncation_side within `TruncationParams`. (#860 ) * Add truncation to enable_truncation * Fix typo * Adding truncation_side within `TruncationParams`. * Node serialization of this direction param. * Update the test. * Fixing warnings/lint. * Adding stuff (can't local debug :( ) * Slow loop... ;( * Stub.py. Co-authored-by: Niels Rogge <niels.rogge1@gmail.com>	2021-12-28 12:37:06 +01:00
Luc Georges	c4c9de23a5	Feature: Handle invalid truncate direction (#858 ) * refacto: TruncateDirection -> TruncationDirection * feat(node): invalid direction will throw * feat(python): invalid direction will throw * Update bindings/node/lib/bindings/raw-encoding.test.ts * Update bindings/python/tests/bindings/test_encoding.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2021-12-27 14:31:57 +01:00
Nicolas Patry	c1100ec542	Clippy fixes. (#846 ) * Clippy fixes. * Drop support for Python 3.6 * Remove other 3.6 * Re-enabling caches for build (5h + seems too long and issue seems solved) https://github.com/actions/virtual-environments/issues/572 * `npm audit fix`. * Fix yaml ? * Pyarrow issue fixed: https://github.com/huggingface/datasets/pull/2268 * Installing dev libraries. * Install python dev elsewhere ? * Typo. * No sudo. * ... * Testing the GH again. * Maybe v2 will fix ? * Fixing tests on MacOS Python 3.8+	2021-12-15 15:55:48 +01:00
Anthony Moi	35c96e5e3f	Add tests for from_pretrained	2021-08-31 09:00:05 -04:00
Vlad Artamonov	e2bf8daa3a	Add SplitDelimiterBehavior to Punctuation constructor (#657 ) Resolves: #642	2021-08-13 09:19:23 -04:00
SaulLu	da4c7b10e4	Add a way to specify the unknown token in `SentencePieceUnigramTokenizer` python implem (#762 ) * add a way to specify the unknown token in `SentencePieceUnigramTokenizer` * add test that verify that an exception is raised for the missing unknown token * style * add test tokens	2021-08-12 09:42:44 -04:00
Nicolas Patry	2e2e7558f7	Add CTC Decoder for Wave2Vec models (#693 ) * Rust - add a CTCDecoder as a seperate mod * Adding bindings to Node + Python. * Clippy update. * Stub. * Fixing roberta.json URLs. * Moving test files to hf.co. * Update cargo check and clippy to 1.52. * Inner ':' actually is used for domains in sphinx. Making `domain` work correctly was just too much work so I went the easy way and have global roles for the custom rust extension. * Update struct naming and docs * Update changelog Co-authored-by: Thomaub <github.thomaub@gmail.com> Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2021-05-20 09:30:09 -04:00
Anthony MOI	57200144ca	Python - Fix ByteLevel instantiation from state (#621 )	2021-02-04 10:16:05 -05:00
Anthony MOI	6a29dbc070	Doc - Hotfix training from iterators tutorial	2021-02-03 15:50:09 -05:00
Anthony MOI	91dae1de15	Doc - Add documentation for training from iterators	2021-01-12 15:51:38 -05:00
Anthony MOI	49d11b1f69	Python - Add components getter/setters to BaseTokenizer	2021-01-11 16:08:38 -05:00
Anthony MOI	d94fa220b6	Python - Add train_from_iterator to implementations	2021-01-07 09:02:20 -05:00
Anthony MOI	a351d1c604	Python - Trainers can get/set their attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	3eb7ef6d0a	Python - PreTokenizers can get/set their attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	5c35fafc44	Python - Decoders can get/set their attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	2feccdbbfa	Python - PyStrip can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	7512d5e4ce	Python - PyBertNormalizer can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	78beae8b7d	Python - PyWordLevel can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	760537aad3	Python - PyWordPiece can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	76d3b2128b	Python - PyBPE can get/set its attributes	2020-11-27 17:35:34 -05:00
Patrick von Platen	dd399d2ad0	Split Pre-Tokenizer (#542 ) * start playing around * make a first version * refactor * apply make format * add python bindings * add some python binding tests * correct pre-tokenizers * update auto-generated bindings * lint python bindings * add code node * add split to docs * refactor python binding a bit * cargo fmt * clippy and fmt in node * quick updates and fixes * Oops * Update node typings * Update changelog Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-27 17:07:03 -05:00
Anthony MOI	933a2a9c99	Python - Improve pre-tokenizers docs	2020-11-23 11:52:51 -05:00
Anthony MOI	c01c301743	Python - Improve documentation for decoders and remove useless kwargs	2020-11-23 11:52:51 -05:00
Anthony MOI	2fbd6779f6	Make sure TrainerWrapper can only train the right Model	2020-11-20 13:30:44 -05:00
Anthony MOI	387b8a1033	Generate pyi, fix tests and clippy warnings	2020-11-20 13:30:44 -05:00
Anthony MOI	5059be1a8d	Test BPE keeping its options after training	2020-11-20 13:30:44 -05:00
Anthony MOI	d3d9f2c76b	words -> word_ids & sequences -> sequence_ids	2020-11-09 16:02:07 -05:00

1 2 3

126 Commits