tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-23 00:35:35 +00:00

Author	SHA1	Message	Date
Funtowicz Morgan	b4fcc9ce6e	Makes `decode` and `decode_batch` work on borrowed content. (#1251 ) * Makes `decode` and `decode_batch` work on borrowed content. * Make `decode_batch` work with borrowed content. * Fix lint. * Attempt to map it into Node. * Second attempt. * Step by step. * One more step. * Fix lint. * Please ... * Removing collect. * Revert "Removing collect." This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-17 11:18:15 +02:00
mert-kurttutan	5c18ec5ff5	pyo3 v0.18 migration (#1173 ) * pyo v0.18 migration * Fix formatting issues of black	2023-03-08 11:27:47 +01:00
David Hewitt	8129dd3309	pyo3: update to 0.17 (#1066 ) * python: update bindings to edition 2021 * python: update to pyo3 0.17 * Updating testing. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-10-05 16:59:01 +02:00
h-vetinari	519cc13be0	Upgrade pyo3 to 0.16 (#956 ) * Upgrade pyo3 to 0.15 Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com> * Upgrade pyo3 to 0.16 Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com> * Install Python before running cargo clippy * Fix clippy warnings * Use `PyArray_Check` instead of downcasting to `PyArray1<u8>` * Enable `auto-initialize` of pyo3 to fix `cargo test --no-default-features` * Fix some test cases Why do they change? * Refactor and add SAFETY comments to `PyArrayUnicode` Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData` Co-authored-by: messense <messense@icloud.com>	2022-05-05 15:48:40 +02:00
Thomas Wang	88d718207a	tokenizer.save has the wrong arguments compared to documentation (#901 ) * tokenizer.save has the wrong arguments compared to documentation * Fixing doc of `save` function. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-02-15 17:55:55 +01:00
Nicolas Patry	152880ab3e	Adding truncation_side within `TruncationParams`. (#860 ) * Add truncation to enable_truncation * Fix typo * Adding truncation_side within `TruncationParams`. * Node serialization of this direction param. * Update the test. * Fixing warnings/lint. * Adding stuff (can't local debug :( ) * Slow loop... ;( * Stub.py. Co-authored-by: Niels Rogge <niels.rogge1@gmail.com>	2021-12-28 12:37:06 +01:00
Anthony MOI	b8b584d4e5	Python - Pretty json saving defaults to true (#793 ) * Python - Pretty json saving defaults to true * Update changelog	2021-09-02 08:43:54 -04:00
Anthony Moi	6f9e867330	Better export for FromPretrainedParameters	2021-08-31 09:00:05 -04:00
Anthony Moi	e44fdee4a1	Python - Add bindings to Tokenizer.from_pretrained	2021-08-31 09:00:05 -04:00
Anthony MOI	56a9196030	Fix clippy warnings	2021-03-16 12:32:06 -04:00
Anthony MOI	817c5ad317	Fix clippy warnings for rust 1.49	2021-01-06 15:03:33 -05:00
Anthony MOI	5938a12b3f	Python - Improve training with iterators	2021-01-06 11:38:43 -05:00
Anthony MOI	3a8627ce4d	Improve docs and fix tests around training	2020-11-28 12:29:35 -05:00
Anthony MOI	999067454d	Make sure we first try to extract a string	2020-11-28 12:29:35 -05:00
Anthony MOI	c36ac0bfdf	Improve progress tracking while training	2020-11-28 12:29:35 -05:00
Anthony MOI	75deaecdd0	Also accept iterators of batches in train_from_iterator	2020-11-28 12:29:35 -05:00
Anthony MOI	e0a70f1fb2	Add ability to train from Iterator	2020-11-28 12:29:35 -05:00
Anthony MOI	a351d1c604	Python - Trainers can get/set their attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	c22cfc31f9	Python - PyNormalizer & PyPreTokenizer use a RwLock	2020-11-27 17:35:34 -05:00
Anthony MOI	7f3cfebf45	Python - PyModel uses a RwLock to allow modifications	2020-11-27 17:35:34 -05:00
Anthony MOI	5059be1a8d	Test BPE keeping its options after training	2020-11-20 13:30:44 -05:00
Anthony MOI	284a1dbee7	PyModel uses a RwLock to allow modifications	2020-11-20 13:30:44 -05:00
Anthony MOI	54c7210b2f	Train Model in place This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.	2020-11-20 13:30:44 -05:00
Anthony MOI	224862fe0c	Python - Make the trainer optional on Tokenizer.train	2020-11-20 13:30:44 -05:00
Nicolas Patry	352c92ad33	Automatically stubbing the `pyi` files while keeping inspecting ability (#509 ) * First pass on automatic stubbing our python files. * And now modifying all rust docs to be visible in Pyi files. * Better assert fail message. * Fixing github workflow. * Removing types not exported anymore. * Fixing `Tokenizer` signature. * Disabling auto __init__.py. * Re-enabling some types. * Don't overwrite non automated __init__.py * Automated most __init__.py * Restubbing after rebase. * Fixing env for tests. * Install blakc in the env. * Use PY35 target in stub.py Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-17 15:13:00 -05:00
Anthony MOI	a86d49634c	Doc - API Reference for most Tokenizer methods/attributes	2020-11-02 17:07:27 -05:00
Anthony MOI	8c0370657e	Doc - Update API Reference on more Tokenizer methods	2020-11-02 17:07:27 -05:00
Anthony MOI	ddabe130cd	Doc - Updated API Reference for AddedToken	2020-11-02 17:07:27 -05:00
Anthony MOI	79f02bb7f0	Doc - Updated API Reference for encode/encode_batch	2020-11-02 17:07:27 -05:00
Anthony MOI	3ee54766e3	Doc - Backbone for API Reference	2020-11-02 17:07:27 -05:00
Nicolas Patry	180371d929	Fixing hanging error while acquiring GIL from custom pretokenizer during training. (#470 ) * Fixing hanging error while acquiring GIL from custom pretokenizer during training. Fixes #469 * cleanup Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-10-20 14:23:39 -04:00
Anthony MOI	8d04b22278	Python - Add support for custom Normalizer	2020-09-23 15:50:01 -04:00
Anthony MOI	940f8bd8fa	Update PyO3 (#426 )	2020-09-22 12:00:20 -04:00
Nicolas Patry	52082b5476	New clippy comments?	2020-09-02 16:32:50 +02:00
Anthony MOI	bd8dac202c	Add failing test for from_file	2020-09-01 09:53:50 -04:00
Anthony MOI	3d1322f108	Python - Improve and Test EncodeInput extraction	2020-08-21 18:39:49 -04:00
Anthony MOI	14adf18e5b	Python - Extract single pre-tokenized inputs from np.array	2020-08-21 18:39:49 -04:00
Anthony MOI	d919d68889	Python - InputSequence with references when possible	2020-08-21 18:39:49 -04:00
Anthony MOI	504d8c85d8	Remove Tokenizer::normalize This is actually a legacy function that doesn't really make sense now, and is getting really difficult to keep. So we remove it.	2020-08-19 12:42:12 -04:00
Anthony MOI	f92c9955e7	Python - Update bindings	2020-08-19 12:42:12 -04:00
Sebastian Pütz	10a39ba6b4	Add in-place train.	2020-08-04 15:59:33 -04:00
Sebastian Puetz	16f75d9efc	Ensure serialization works in all expected ways.	2020-08-04 15:59:33 -04:00
Sebastian Puetz	aaf8e932b1	Remove Send + Sync requirements from Model.	2020-08-04 15:59:33 -04:00
Sebastian Puetz	42b810488f	Hide generics	2020-08-04 15:59:33 -04:00
Sebastian Pütz	d62adf7195	Remove Container, changes to PyDecoder, cloneable Tokenizer. * derive Clone on Tokenizer and AddedVocabulary. * Replace Container with Arc wrapper for Decoders. * Prefix Rust Decoder types with Py. * Rename PyDecoder to CustomDecoder. * Change panic in serializing custom decoder to exception. * Re-enable training with cloneable Tokenizer. * Remove unsound Container, use Arc wrappers instead.	2020-08-04 15:59:33 -04:00
Sebastian Pütz	11e86a16c5	Remove Container from PostProcessors, replace with Arc. * prefix the Python types in Rust with Py. * remove unsound Container wrappers, replace with Arc.	2020-08-04 15:59:33 -04:00
Sebastian Pütz	b411443128	Remove Container from PreTokenizers, replace with Arc. * prefix the Python types in Rust with Py, rename PyPretokenizer to CustomPretokenizer * remove unsound Container wrappers, replace with Arc * change panic on trying to (de-)serialize custom pretokenizer to exception	2020-08-04 15:59:33 -04:00
Sebastian Pütz	08b8c48127	Remove Container from Normalizers, replace with Arc. * prefix the Python types in Rust with Py * remove unsound Container wrappers, replace with Arc	2020-08-04 15:59:33 -04:00
Sebastian Pütz	83a52c8080	Replace Model and Trainer Containers. * Implement changes necessary from generic Model in Tokenizer. * Temporarily disable training in Python since Clone can't be derived for Model until all components have been replaced. * Prefix Python types in Rust with Py.	2020-08-04 15:59:33 -04:00
Sebastian Pütz	27e326ab2b	Fix deadlocks with custom python components.	2020-08-03 16:17:17 -04:00

1 2 3

102 Commits