tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
Nicolas Patry	cc5fb01a2f	Decode stream python (#1678 ) * Python binding for decode stream Different API because Python cannot handle lifetimes properly. * Clippy.	2024-11-15 12:06:22 +01:00
Arthur	49dafd707e	Fix strip python type (#1602 ) * update * the fix * Revert "update" This reverts commit 4c2f32f116479b0ec8ccd7c832f86cbc8787d8a9. * add a test and rebase * style * oups	2024-08-07 15:36:28 +02:00
Arthur	bded212356	Support `None` to reset pre_tokenizers and normalizers, and index sequences (#1590 ) * initial commit * support None * fix clippy * cleanup * clean? * propagate to pre_tokenizer * fix test * fix rust tests * fix node * propagate to decoder and post processor * fix calls * lint * fmt * node be happy I am fixing you * initial commit * support None * fix clippy * cleanup * clean? * propagate to pre_tokenizer * fix test * fix rust tests * fix node * propagate to decoder and post processor * fix calls * lint * fmt * node be happy I am fixing you * add a small test * styling * style merge * fix merge test * fmt * nits * update tset	2024-08-07 12:52:35 +02:00
Nicolas Patry	ab9c7ded8b	Using serde (serde_pyo3) to get __str__ and __repr__ easily. (#1588 ) * Using serde (serde_pyo3) to get __str__ and __repr__ easily. * Putting it within tokenizers, it needs to be too specific. * Clippy is our friend. * Ruff. * Update the tests. * Pretty sure this is wrong (#1589) * Adding support for ellipsis. * Fmt. * Ruff. * Fixing tokenizer. --------- Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>	2024-08-07 12:08:29 +02:00
Arthur	4ea2f235b0	Add bytelevel normalizer to fix decode when adding tokens to BPE (#1555 ) * feature dependent test * nit about 嗎 * update * actuallyfix it * update the test add it fix * stub * Update tokenizers/src/pre_tokenizers/byte_level.rs Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> * skip failing test * add normalizer to init --------- Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>	2024-07-15 12:12:03 +02:00
Marco	fdd26ba9a3	Enable `dropout = 0.0` as an equivalent to `none` in BPE (#1550 ) * enable dropout = 0.0 * typo * lint * formatter * enable dropout = 0.0 * formatter	2024-06-24 12:36:11 +02:00
Lucain	88f51fe7d2	Switch from cached_download to hf_hub_download in tests (#1547 )	2024-06-11 15:26:58 +02:00
Arthur	f2ec3b239b	remove enforcement of non special when adding tokens (#1521 ) * remove enforcement of non special when adding tokens * mut no longer needed * add a small test * nit * style * audit * ignore cargo audit's own vulnerability * update * revert * remove CVE	2024-04-30 15:53:47 +02:00
Nicolas Patry	91393ef75e	Fixing doc. (#1499 ) * Fixing doc. * SentencePieceUnigram and Convert.py still used sentencepiece * stub --------- Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>	2024-04-17 09:32:40 +02:00
Arthur	09069717e9	Refactor metaspace (#1476 ) * version = "0.15.3-dev-0” Improve performances of meta space, but also just fix it. (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] ['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] [0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848] (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] ['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] [0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416] * well what do we have * nit * be BC with non legacy * unrelated change for clippy * fix test * splitting is a must for word_ids * fmt and lint * Fixing everything (hopefully better). * Fixing node. * Including yarn.lock * Lint. * Stubs. * revert to use split * fix merge issues * fix tests * finish fixing tests * ruff --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-03-30 10:27:24 +01:00
Arthur	29fef1e7aa	[`remove black`] And use ruff (#1436 ) * nits * Fixing deps. * Ruff update. * Import order matters. * Fix. * Revert ruff fix. * Visualizer. * Putting back the imports. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-03-12 11:24:21 +01:00
Arthur	6a77d4859b	Encode special tokens (#1437 ) * add doc in the code * add option to skip special tokens * nits * add api dummy for now * Fmt. * Fix fmt. * Fix the stub. * add a test * add a test in python * style it * nits * add getter and setters * stub * update python test * fmt * last nit --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-01-19 12:43:43 +01:00
Mario Šaško	11462596d1	Faster HF dataset iteration in docs (#1414 ) * Faster HF dataset iteration in docs * Nit	2023-12-14 16:12:56 +01:00
Arthur	f55822baea	[`pre_tokenizers`] Fix sentencepiece based Metaspace (#1357 ) * nits * allow for legacy beahaviour without making any breaking changes * add a todo * set to legacy by default * skip legacy serialization * push correct update * lint * add deserialization test * add a python test as well * updates * fix serialization tests * nits * python stylijng of the tests * better tests * fix offsets * fix imports * fmt * update metaspace * remove TODO * use enm * fix some tses * nits * use enum * update tests * syling * remove impl from for PrependScheme * use simple getters and setters * lint * update tests * add test new == new_with_prepend_scheme * revert a change * use setters and getterts * Update bindings/python/src/pre_tokenizers.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * nits * use copy rather than ref * nits format * more nits * allow option string * enforce First Never Always camel cased * nits * refactor * update test as well * fmt * nits * properly error out * Update bindings/python/src/pre_tokenizers.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * suggestion changes --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-11-14 18:05:07 +01:00
Arthur Zucker	26fdfc2bc3	style	2023-09-05 16:42:45 +00:00
Arthur Zucker	08af8ea9c3	make tests happy	2023-09-05 15:37:09 +00:00
Arthur Zucker	f1da83f358	add support for `get_added_tokens_decoder`	2023-09-05 14:49:29 +00:00
Arthur Zucker	93b37f36dc	styling	2023-09-04 20:54:55 +00:00
Arthur Zucker	058e34b421	make special editable as well	2023-09-04 20:54:29 +00:00
Arthur Zucker	d4008b0d7a	cliipy	2023-09-04 19:11:05 +00:00
Arthur Zucker	b117ac7f16	updates	2023-09-04 19:10:22 +00:00
Arthur Zucker	a53dff9bc5	make content writable in python	2023-09-04 18:18:21 +00:00
Arthur Zucker	d9829cdc6e	fix more tests	2023-09-04 17:22:27 +00:00
Arthur	864135bef1	Add unigram bytefallback (#1217 ) * current updates will go red * cargo fmt * npm install * refactor train for unigram to allow bytefallbakc (breaking) * fmt * nits * update * add a proper test * fix encode optimised fallback + add trainer arg * fixes * fixes * fix tests * add test * fmt * fix rust test * update python bindings * update * pub is okay and needed * more fix * cleanup * remove useles id * MissingUnkId error * nits * fix offset * add a test in python * update src bindings * remove bytefallback from trainer * styling * update pckg * lint * fmt * stup with dev * update code based on review * remove unused function * udpate python test to compare ids * fix option bool issues * final fix * clippy * fix npm isntall * update * update test * more in depth testing * Lint * last attempt to fix node * update node bindings * fmt * Update tokenizers/src/models/unigram/model.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * update based on review * simpler test * lint --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-06-26 10:46:59 +02:00
Chris Ha	cefc41e8ec	implement a simple max_sentencepiece_length into BPE (#1228 ) * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * utilize Option<u16> for safer code. * Other version. * Update trainer.rs clarify with type usize propagate max_length option * change max_length into more descriptive name in the documentation https://huggingface.co/docs/tokenizers/api/trainers unigramtrainer uses max_piece_length for similar function. since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive. * change variable name in trainer.rs change max_merge_length into max_token_length * Update trainer.rs add several max_token_length declaration that were missing. impl BpeTrainerBuilder struct BpeTrainer Add explanation for variable shadowing. * Update trainer.rs Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly. * add max_token_length test * Add bpe direct assert test * Update trainer.rs clarified test documentation * Creating the bindings. * Fix the default. * Re-adding missing package-lock which I accidentally removed. * .. * Fixing trainer test. * Fix. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-16 10:08:19 +02:00
Nicolas Patry	3aaf4946b3	Add `content` to Strip decoder to allow decoding mid tokens. (#1199 ) * Add `content` to Strip decoder to allow decoding mid tokens. * Stub. * Clippy.	2023-03-24 10:14:49 +01:00
Nicolas Patry	e4aea890d5	Adding 2 new decoders: (#1196 ) * Adding 2 new decoders: - Fuse will simply concatenate all tokens into 1 string - Strip will remove n char from left or right Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want for the `Metaspace` thing. - Note: Added a new dependency from better parsing of decoders. This is due to untagged enums which can match anything the `MustBe` ensure there's no issue between Fuse and ByteFallback. Since both are new the chances for backward incompatibility is low. * Fixing picking/unpickling (using default args.). * Stub. * Black. * Fixing node.	2023-03-24 00:50:54 +01:00
Nicolas Patry	d2c8190a0f	Creating `normalizers.Prepend` (To be used instead of `Metaspace`). (#1194 ) * Creating `normalizers.Prepend` (To be used instead of `Metaspace`). * Linting + stub. * Fixing pickling/unpickling by setting a default. * Black.	2023-03-24 00:33:31 +01:00
Nicolas Patry	250d46c676	Adding `Replace` to decoder (to undo the Replace Normalizer for (#1195 ) Metaspace split).	2023-03-23 23:43:47 +01:00
Nicolas Patry	73637a0004	Adding ByteFallback support for `tokenizers`. (#1183 ) * Adding ByteFallback support for `tokenizers`. Two items added: - A flag `byte_fallback` for the `BPE` model. This will be in charge of using `<0x61>` instead of unk on unknown tokens. - A ByteFallback decoder, which will be in charge of putting everything back into string whenever possible. Showing � when the byte decoding fails (behavior checked against LlamaTokenizer in `transformers`. * Update rustdoc. * Clippy + Add BPE(byte_fallback) into bindings. * Stupid file. * Test artifacts removed. * Update stub. * Fix. * Bad file. * CRITICAL FIX: wrapper order because of untagged.... * Remove prints. * Fixing <16 byte fallback.	2023-03-23 16:04:32 +01:00
mert-kurttutan	5c18ec5ff5	pyo3 v0.18 migration (#1173 ) * pyo v0.18 migration * Fix formatting issues of black	2023-03-08 11:27:47 +01:00
Nicolas Patry	6113666624	Updating python formatting. (#1079 ) * Updating python formatting. * Forgot gh action. * Skipping isort to prevent circular imports. * Updating stub. * Removing `isort` (it contradicts `stub.py`). * Fixing weird stub black/isort disagreeement.	2022-10-05 15:29:33 +02:00
Nicolas Patry	06025e4ca1	Adding `Sequence` for `PostProcessor`. (#1052 ) * Adding `Sequence` for `PostProcessor`. * Fixing node? Writing in the dark here, don't have Python2.7 * `undefined` is not accepted. * Other test.	2022-08-25 14:50:06 +02:00
Nicolas Patry	943b5421aa	Changing `Decoder` trait to be more composable. (#938 ) (#1008 ) * Changing `Decoder` trait to be more composable. (#938) * Changing `Decoder` trait to be more composable. Fix #872 * Fixing Python side. * Fixing test. * Updating cleanup signature, removing turbofish. * Adding `Sequence` Decoder.	2022-06-02 14:43:42 +02:00
Nicolas Patry	ec43947786	Revert "Changing `Decoder` trait to be more composable. (#938 )" (#971 ) This reverts commit `cdabef14c4`.	2022-04-04 09:43:28 +02:00
Nicolas Patry	a5f644616b	Fix the error test for Python 3.10 (error message is different). (#962 )	2022-03-23 10:35:58 +01:00
Kaito Sugimoto	1bb9884f45	Fixing the vocab size of the trained Unigram model (#952 ) * Fixing the vocab size of the trained Unigram model * add test for the vocab size of the trained Unigram model * Revert "add test for the vocab size of the trained Unigram model" This reverts commit fb8955c831b357d1037548ceaa8789734d544646. * Fixing the vocab size of the trained Unigram model * format codes * get the position of vocab-size calculation out of loop	2022-03-18 18:13:17 +01:00
Nicolas Patry	cdabef14c4	Changing `Decoder` trait to be more composable. (#938 ) * Changing `Decoder` trait to be more composable. Fix #872 * Fixing Python side. * Fixing test. * Updating cleanup signature, removing turbofish.	2022-03-17 10:32:09 +01:00
Nicolas Patry	4b6055d4fb	Adding pickling support for trainers (#949 ) * TMP. * Adding support for pickling Python trainers. * Remove not warranted files + missed naming updates. * Stubbing. * Making sure serialized format is written in python tests.	2022-03-14 12:18:11 +01:00
Nicolas Patry	1a84958cc8	Fixing bad deserialization following inclusion of a default for `Punctuation`. (#884 ) * Fixing bad deserialization following inclusion of a default for `Punctuation`. * don't remove the type now... * Adding slow test to run on all the tokenizers of the hub. * `PartialEq` everywhere. * Forcing `type` to exist on the `pre_tokenizers`.	2022-01-17 22:28:25 +01:00
Nicolas Patry	1054e243e2	Fix invalid continuing subwrd prefix. (#864 ) * Creating failing test for invalid continuing subwrd prefix. * Test in rust + the associated fix. * Clippy. * Black.	2022-01-04 14:25:35 +01:00
Nicolas Patry	152880ab3e	Adding truncation_side within `TruncationParams`. (#860 ) * Add truncation to enable_truncation * Fix typo * Adding truncation_side within `TruncationParams`. * Node serialization of this direction param. * Update the test. * Fixing warnings/lint. * Adding stuff (can't local debug :( ) * Slow loop... ;( * Stub.py. Co-authored-by: Niels Rogge <niels.rogge1@gmail.com>	2021-12-28 12:37:06 +01:00
Luc Georges	c4c9de23a5	Feature: Handle invalid truncate direction (#858 ) * refacto: TruncateDirection -> TruncationDirection * feat(node): invalid direction will throw * feat(python): invalid direction will throw * Update bindings/node/lib/bindings/raw-encoding.test.ts * Update bindings/python/tests/bindings/test_encoding.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2021-12-27 14:31:57 +01:00
Nicolas Patry	c1100ec542	Clippy fixes. (#846 ) * Clippy fixes. * Drop support for Python 3.6 * Remove other 3.6 * Re-enabling caches for build (5h + seems too long and issue seems solved) https://github.com/actions/virtual-environments/issues/572 * `npm audit fix`. * Fix yaml ? * Pyarrow issue fixed: https://github.com/huggingface/datasets/pull/2268 * Installing dev libraries. * Install python dev elsewhere ? * Typo. * No sudo. * ... * Testing the GH again. * Maybe v2 will fix ? * Fixing tests on MacOS Python 3.8+	2021-12-15 15:55:48 +01:00
Anthony Moi	35c96e5e3f	Add tests for from_pretrained	2021-08-31 09:00:05 -04:00
Vlad Artamonov	e2bf8daa3a	Add SplitDelimiterBehavior to Punctuation constructor (#657 ) Resolves: #642	2021-08-13 09:19:23 -04:00
SaulLu	da4c7b10e4	Add a way to specify the unknown token in `SentencePieceUnigramTokenizer` python implem (#762 ) * add a way to specify the unknown token in `SentencePieceUnigramTokenizer` * add test that verify that an exception is raised for the missing unknown token * style * add test tokens	2021-08-12 09:42:44 -04:00
Nicolas Patry	2e2e7558f7	Add CTC Decoder for Wave2Vec models (#693 ) * Rust - add a CTCDecoder as a seperate mod * Adding bindings to Node + Python. * Clippy update. * Stub. * Fixing roberta.json URLs. * Moving test files to hf.co. * Update cargo check and clippy to 1.52. * Inner ':' actually is used for domains in sphinx. Making `domain` work correctly was just too much work so I went the easy way and have global roles for the custom rust extension. * Update struct naming and docs * Update changelog Co-authored-by: Thomaub <github.thomaub@gmail.com> Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2021-05-20 09:30:09 -04:00
Anthony MOI	57200144ca	Python - Fix ByteLevel instantiation from state (#621 )	2021-02-04 10:16:05 -05:00
Anthony MOI	6a29dbc070	Doc - Hotfix training from iterators tutorial	2021-02-03 15:50:09 -05:00

1 2 3

144 Commits