Commit Graph

102 Commits

Author SHA1 Message Date
b4fcc9ce6e Makes decode and decode_batch work on borrowed content. (#1251)
* Makes `decode` and `decode_batch` work on borrowed content.

* Make `decode_batch` work with borrowed content.

* Fix lint.

* Attempt to map it into Node.

* Second attempt.

* Step by step.

* One more step.

* Fix lint.

* Please ...

* Removing collect.

* Revert "Removing collect."

This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-17 11:18:15 +02:00
5c18ec5ff5 pyo3 v0.18 migration (#1173)
* pyo v0.18 migration

* Fix formatting issues of black
2023-03-08 11:27:47 +01:00
8129dd3309 pyo3: update to 0.17 (#1066)
* python: update bindings to edition 2021

* python: update to pyo3 0.17

* Updating testing.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-10-05 16:59:01 +02:00
519cc13be0 Upgrade pyo3 to 0.16 (#956)
* Upgrade pyo3 to 0.15

Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>

* Upgrade pyo3 to 0.16

Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>

* Install Python before running cargo clippy

* Fix clippy warnings

* Use `PyArray_Check` instead of downcasting to `PyArray1<u8>`

* Enable `auto-initialize` of pyo3 to fix `cargo test
--no-default-features`

* Fix some test cases

Why do they change?

* Refactor and add SAFETY comments to `PyArrayUnicode`

Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData`

Co-authored-by: messense <messense@icloud.com>
2022-05-05 15:48:40 +02:00
88d718207a tokenizer.save has the wrong arguments compared to documentation (#901)
* tokenizer.save has the wrong arguments compared to documentation

* Fixing doc of `save` function.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-02-15 17:55:55 +01:00
152880ab3e Adding truncation_side within TruncationParams. (#860)
* Add truncation to enable_truncation

* Fix typo

* Adding truncation_side within `TruncationParams`.

* Node serialization of this direction param.

* Update the test.

* Fixing warnings/lint.

* Adding stuff (can't local debug :( )

* Slow loop... ;(

* Stub.py.

Co-authored-by: Niels Rogge <niels.rogge1@gmail.com>
2021-12-28 12:37:06 +01:00
b8b584d4e5 Python - Pretty json saving defaults to true (#793)
* Python - Pretty json saving defaults to true

* Update changelog
2021-09-02 08:43:54 -04:00
6f9e867330 Better export for FromPretrainedParameters 2021-08-31 09:00:05 -04:00
e44fdee4a1 Python - Add bindings to Tokenizer.from_pretrained 2021-08-31 09:00:05 -04:00
56a9196030 Fix clippy warnings 2021-03-16 12:32:06 -04:00
817c5ad317 Fix clippy warnings for rust 1.49 2021-01-06 15:03:33 -05:00
5938a12b3f Python - Improve training with iterators 2021-01-06 11:38:43 -05:00
3a8627ce4d Improve docs and fix tests around training 2020-11-28 12:29:35 -05:00
999067454d Make sure we first try to extract a string 2020-11-28 12:29:35 -05:00
c36ac0bfdf Improve progress tracking while training 2020-11-28 12:29:35 -05:00
75deaecdd0 Also accept iterators of batches in train_from_iterator 2020-11-28 12:29:35 -05:00
e0a70f1fb2 Add ability to train from Iterator 2020-11-28 12:29:35 -05:00
a351d1c604 Python - Trainers can get/set their attributes 2020-11-27 17:35:34 -05:00
c22cfc31f9 Python - PyNormalizer & PyPreTokenizer use a RwLock 2020-11-27 17:35:34 -05:00
7f3cfebf45 Python - PyModel uses a RwLock to allow modifications 2020-11-27 17:35:34 -05:00
5059be1a8d Test BPE keeping its options after training 2020-11-20 13:30:44 -05:00
284a1dbee7 PyModel uses a RwLock to allow modifications 2020-11-20 13:30:44 -05:00
54c7210b2f Train Model in place
This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.
2020-11-20 13:30:44 -05:00
224862fe0c Python - Make the trainer optional on Tokenizer.train 2020-11-20 13:30:44 -05:00
352c92ad33 Automatically stubbing the pyi files while keeping inspecting ability (#509)
* First pass on automatic stubbing our python files.

* And now modifying all rust docs to be visible in Pyi files.

* Better assert fail message.

* Fixing github workflow.

* Removing types not exported anymore.

* Fixing `Tokenizer` signature.

* Disabling auto __init__.py.

* Re-enabling some types.

* Don't overwrite non automated __init__.py

* Automated most __init__.py

* Restubbing after rebase.

* Fixing env for tests.

* Install blakc in the env.

* Use PY35 target in stub.py

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-17 15:13:00 -05:00
a86d49634c Doc - API Reference for most Tokenizer methods/attributes 2020-11-02 17:07:27 -05:00
8c0370657e Doc - Update API Reference on more Tokenizer methods 2020-11-02 17:07:27 -05:00
ddabe130cd Doc - Updated API Reference for AddedToken 2020-11-02 17:07:27 -05:00
79f02bb7f0 Doc - Updated API Reference for encode/encode_batch 2020-11-02 17:07:27 -05:00
3ee54766e3 Doc - Backbone for API Reference 2020-11-02 17:07:27 -05:00
180371d929 Fixing hanging error while acquiring GIL from custom pretokenizer during training. (#470)
* Fixing hanging error while acquiring GIL from custom pretokenizer
during training.

Fixes #469

* cleanup

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-10-20 14:23:39 -04:00
8d04b22278 Python - Add support for custom Normalizer 2020-09-23 15:50:01 -04:00
940f8bd8fa Update PyO3 (#426) 2020-09-22 12:00:20 -04:00
52082b5476 New clippy comments? 2020-09-02 16:32:50 +02:00
bd8dac202c Add failing test for from_file 2020-09-01 09:53:50 -04:00
3d1322f108 Python - Improve and Test EncodeInput extraction 2020-08-21 18:39:49 -04:00
14adf18e5b Python - Extract single pre-tokenized inputs from np.array 2020-08-21 18:39:49 -04:00
d919d68889 Python - InputSequence with references when possible 2020-08-21 18:39:49 -04:00
504d8c85d8 Remove Tokenizer::normalize
This is actually a legacy function that doesn't really make sense now, and is getting really difficult to keep. So we remove it.
2020-08-19 12:42:12 -04:00
f92c9955e7 Python - Update bindings 2020-08-19 12:42:12 -04:00
10a39ba6b4 Add in-place train. 2020-08-04 15:59:33 -04:00
16f75d9efc Ensure serialization works in all expected ways. 2020-08-04 15:59:33 -04:00
aaf8e932b1 Remove Send + Sync requirements from Model. 2020-08-04 15:59:33 -04:00
42b810488f Hide generics 2020-08-04 15:59:33 -04:00
d62adf7195 Remove Container, changes to PyDecoder, cloneable Tokenizer.
* derive Clone on Tokenizer and AddedVocabulary.
* Replace Container with Arc wrapper for Decoders.
* Prefix Rust Decoder types with Py.
* Rename PyDecoder to CustomDecoder.
* Change panic in serializing custom decoder to exception.
* Re-enable training with cloneable Tokenizer.
* Remove unsound Container, use Arc wrappers instead.
2020-08-04 15:59:33 -04:00
11e86a16c5 Remove Container from PostProcessors, replace with Arc.
* prefix the Python types in Rust with Py.
* remove unsound Container wrappers, replace with Arc.
2020-08-04 15:59:33 -04:00
b411443128 Remove Container from PreTokenizers, replace with Arc.
* prefix the Python types in Rust with Py, rename PyPretokenizer
  to CustomPretokenizer
* remove unsound Container wrappers, replace with Arc
* change panic on trying to (de-)serialize custom pretokenizer to
  exception
2020-08-04 15:59:33 -04:00
08b8c48127 Remove Container from Normalizers, replace with Arc.
* prefix the Python types in Rust with Py
* remove unsound Container wrappers, replace with Arc
2020-08-04 15:59:33 -04:00
83a52c8080 Replace Model and Trainer Containers.
* Implement changes necessary from generic Model in Tokenizer.
* Temporarily disable training in Python since Clone can't be
  derived for Model until all components have been replaced.
* Prefix Python types in Rust with Py.
2020-08-04 15:59:33 -04:00
27e326ab2b Fix deadlocks with custom python components. 2020-08-03 16:17:17 -04:00