tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
Nicolas Patry	f4c9fd7f40	Testing ABI3 wheels to reduce number of wheels (#1674 ) * Testing ABI3 wheels to reduce number of wheels * No need for py-clone anymore. * Upgrade python versions. * Remove those flakes. * Promoting new CI + Fixing secret.	2024-11-15 06:02:22 +01:00
Dimitris Iliopoulos	6ade8c2d21	PyO3 0.22 (#1665 ) * PyO3 0.22 * Fix python stubs * Remove name arg from PyModel::save Python signature --------- Co-authored-by: Dimitris Iliopoulos <diliopoulos@fb.com>	2024-11-01 10:17:23 +01:00
Nicolas Patry	ab9c7ded8b	Using serde (serde_pyo3) to get __str__ and __repr__ easily. (#1588 ) * Using serde (serde_pyo3) to get __str__ and __repr__ easily. * Putting it within tokenizers, it needs to be too specific. * Clippy is our friend. * Ruff. * Update the tests. * Pretty sure this is wrong (#1589) * Adding support for ellipsis. * Fmt. * Ruff. * Fixing tokenizer. --------- Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>	2024-08-07 12:08:29 +02:00
Nicolas Patry	a010f6b75c	Revert "Using serde (serde_pyo3) to get __str__ and __repr__ easily." This reverts commit `86138337fc`.	2024-08-02 18:42:57 +02:00
Nicolas Patry	86138337fc	Using serde (serde_pyo3) to get __str__ and __repr__ easily.	2024-08-02 18:41:54 +02:00
Nicolas Patry	d5a8cc7a49	PyO3 0.21. (#1494 ) * PyO3 0.21. * Upgraded everything. * Rustfmt.	2024-04-16 13:49:52 +02:00
Stephen Roller	4a8105c366	Convert word counts to u64 (#1433 ) * Convert word counts to u64 * More spots needed to compile	2024-02-06 03:39:12 +01:00
Arthur Zucker	08af8ea9c3	make tests happy	2023-09-05 15:37:09 +00:00
Arthur Zucker	d9829cdc6e	fix more tests	2023-09-04 17:22:27 +00:00
Arthur Zucker	39bd27e673	fix build	2023-09-01 21:22:07 +00:00
Michael Lui	540bf2eb01	pyo3: update to 0.19 (#1322 ) * Bump pyo3 dependency versions * Fix deprecation warnings from pyo3 --------- Co-authored-by: Mike Lui <mikelui@meta.com>	2023-08-16 18:40:32 +02:00
Chris Ha	cefc41e8ec	implement a simple max_sentencepiece_length into BPE (#1228 ) * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * utilize Option<u16> for safer code. * Other version. * Update trainer.rs clarify with type usize propagate max_length option * change max_length into more descriptive name in the documentation https://huggingface.co/docs/tokenizers/api/trainers unigramtrainer uses max_piece_length for similar function. since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive. * change variable name in trainer.rs change max_merge_length into max_token_length * Update trainer.rs add several max_token_length declaration that were missing. impl BpeTrainerBuilder struct BpeTrainer Add explanation for variable shadowing. * Update trainer.rs Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly. * add max_token_length test * Add bpe direct assert test * Update trainer.rs clarified test documentation * Creating the bindings. * Fix the default. * Re-adding missing package-lock which I accidentally removed. * .. * Fixing trainer test. * Fix. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-16 10:08:19 +02:00
mert-kurttutan	5c18ec5ff5	pyo3 v0.18 migration (#1173 ) * pyo v0.18 migration * Fix formatting issues of black	2023-03-08 11:27:47 +01:00
David Hewitt	8129dd3309	pyo3: update to 0.17 (#1066 ) * python: update bindings to edition 2021 * python: update to pyo3 0.17 * Updating testing. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-10-05 16:59:01 +02:00
h-vetinari	519cc13be0	Upgrade pyo3 to 0.16 (#956 ) * Upgrade pyo3 to 0.15 Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com> * Upgrade pyo3 to 0.16 Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com> * Install Python before running cargo clippy * Fix clippy warnings * Use `PyArray_Check` instead of downcasting to `PyArray1<u8>` * Enable `auto-initialize` of pyo3 to fix `cargo test --no-default-features` * Fix some test cases Why do they change? * Refactor and add SAFETY comments to `PyArrayUnicode` Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData` Co-authored-by: messense <messense@icloud.com>	2022-05-05 15:48:40 +02:00
Nicolas Patry	4b6055d4fb	Adding pickling support for trainers (#949 ) * TMP. * Adding support for pickling Python trainers. * Remove not warranted files + missed naming updates. * Stubbing. * Making sure serialized format is written in python tests.	2022-03-14 12:18:11 +01:00
Nicolas Patry	6972e49f1d	Fix the clippy warnings. (#869 )	2022-01-04 14:32:07 +01:00
Sylvain Gugger	6616e699f7	Expand documentation of UnigramTrainer (#770 ) * Expand documentation of UnigramTrainer * Put doc at the source * Add signature * make style Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>	2021-08-12 10:12:26 -04:00
Nicolas Patry	d83772d62c	Fixing tokenizers with 1.53 (updated some dependencies + clippy) (#764 )	2021-07-21 09:58:38 +02:00
Anthony MOI	e0a70f1fb2	Add ability to train from Iterator	2020-11-28 12:29:35 -05:00
Anthony MOI	a351d1c604	Python - Trainers can get/set their attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	58e1d8de67	Python - Improve documentation for trainers	2020-11-23 11:52:51 -05:00
Anthony MOI	13e07da2c8	Node - Add WordLevelTrainer	2020-11-20 13:30:44 -05:00
Anthony MOI	284a1dbee7	PyModel uses a RwLock to allow modifications	2020-11-20 13:30:44 -05:00
Anthony MOI	54c7210b2f	Train Model in place This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.	2020-11-20 13:30:44 -05:00
Anthony MOI	c230183cf6	A Model can return its associated Trainer	2020-11-20 13:30:44 -05:00
Anthony MOI	059d43b265	Add WordLevel trainer	2020-11-20 13:30:44 -05:00
Nicolas Patry	352c92ad33	Automatically stubbing the `pyi` files while keeping inspecting ability (#509 ) * First pass on automatic stubbing our python files. * And now modifying all rust docs to be visible in Pyi files. * Better assert fail message. * Fixing github workflow. * Removing types not exported anymore. * Fixing `Tokenizer` signature. * Disabling auto __init__.py. * Re-enabling some types. * Don't overwrite non automated __init__.py * Automated most __init__.py * Restubbing after rebase. * Fixing env for tests. * Install blakc in the env. * Use PY35 target in stub.py Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-17 15:13:00 -05:00
Anthony MOI	1a6f4b5204	Allow initial_alphabet on UnigramTrainer	2020-10-26 10:57:29 -04:00
Anthony MOI	940f8bd8fa	Update PyO3 (#426 )	2020-09-22 12:00:20 -04:00
Nicolas Patry	52082b5476	New clippy comments?	2020-09-02 16:32:50 +02:00
Nicolas Patry	c0798acacf	Address @n1t0 comments.	2020-09-02 16:32:50 +02:00
Nicolas Patry	d624645cf3	Attempting to add UnigramTrainer to python bindings.	2020-09-02 16:32:50 +02:00
Sebastian Pütz	ac8af63f70	Trainers don't need Arc.	2020-08-04 15:59:33 -04:00
Sebastian Pütz	83a52c8080	Replace Model and Trainer Containers. * Implement changes necessary from generic Model in Tokenizer. * Temporarily disable training in Python since Clone can't be derived for Model until all components have been replaced. * Prefix Python types in Rust with Py.	2020-08-04 15:59:33 -04:00
Anthony MOI	c02d4e2202	Python - Improve AddedToken interface	2020-06-19 17:53:46 -04:00
Bjarte Johansen	2dc48e56ac	Python - Update pyo3 version * Use __new__ instead of static method as model constructors	2020-04-06 21:20:16 +02:00
Anthony MOI	c65d53892d	Python - Add bindings for new AddedToken options	2020-03-24 20:58:45 -04:00
Anthony MOI	08ce105195	Python - Hotfix WordPieceTrainer constructor	2020-02-11 08:13:57 -05:00
Bjarte Johansen	4971e9608d	Implement __new__ on Trainers __new__ allows Trainers to be initialized in the normal python fashion.	2020-02-10 10:43:29 +01:00
Anthony MOI	ef21c9a7b0	Hotfix for new Builder cc @epwalsh	2020-01-08 16:19:51 -05:00
Anthony MOI	c51e340492	Python - Add WordPieceTrainer	2020-01-03 19:37:29 -05:00
Anthony MOI	e64b54b29e	Python - Update BpeTrainer interface	2020-01-03 19:37:29 -05:00
Anthony MOI	0589deb6e2	Python - Expose BpeTrainer options	2020-01-02 18:09:04 -05:00
epwalsh	c0ed873c4d	simplify initialization of BpeTrainer	2019-12-23 20:13:48 -05:00
Anthony MOI	eaafb22511	Add bindings for Trainer in Python	2019-12-03 15:54:15 -05:00

46 Commits