Commit Graph

120 Commits

Author SHA1 Message Date
888dd4bc65 pyo3: update to 0.20 (#1386)
Co-authored-by: Mike Lui <mikelui@meta.com>
2024-01-11 17:03:13 +01:00
efec086f35 get_added_tokens_decoder returns BTREEMap 2023-09-06 12:24:30 +00:00
f435af8b71 linting 2023-09-05 16:43:06 +00:00
c3fa75fa0e nits 2023-09-05 15:40:13 +00:00
08af8ea9c3 make tests happy 2023-09-05 15:37:09 +00:00
f1da83f358 add support for get_added_tokens_decoder 2023-09-05 14:49:29 +00:00
058e34b421 make special editable as well 2023-09-04 20:54:29 +00:00
c599db1421 nits 2023-09-04 19:11:19 +00:00
b117ac7f16 updates 2023-09-04 19:10:22 +00:00
a53dff9bc5 make content writable in python 2023-09-04 18:18:21 +00:00
39bd27e673 fix build 2023-09-01 21:22:07 +00:00
9f0c703f03 update init and src for bingings python 2023-09-01 21:07:01 +00:00
d2010d5165 Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331)
* Move to maturing mimicking move for `safetensors`.

* Tmp.

* Fix sdist.

* Wat?

* Clippy 1.72

* Remove if.

* Conda sed.

* Fix doc check workflow.

* Moving to maturin AND removing http + openssl mess (smoothing transition
moving to `huggingface_hub`)

* Fix dep

* Black.

* New node bindings.

* Fix docs + node cache ?

* Yarn.

* Working dir.

* Extension module.

* Put back interpreter.

* Remove cache.

* New attempt

* Multi python.

* Remove FromPretrained.

* Remove traces of `fromPretrained`.

* Drop 3.12 for windows?

* Typo.

* Put back the default feature for ignoring links during simple test.

* Fix ?

* x86_64 -> x64.

* Remove warning for windows bindings.

* Excluse aarch.

* Include/exclude.

* Put back workflows in correct states.
2023-08-28 16:24:14 +02:00
d0bb35d5a6 Merge pull request #1316 from boyleconnor/add-expect-for-no-truncation
Add `expect()` for disabling truncation
2023-08-18 19:30:53 +02:00
540bf2eb01 pyo3: update to 0.19 (#1322)
* Bump pyo3 dependency versions

* Fix deprecation warnings from pyo3

---------

Co-authored-by: Mike Lui <mikelui@meta.com>
2023-08-16 18:40:32 +02:00
748556a9ed Fix code style 2023-08-07 15:17:43 -07:00
a0a8ebe03f Add expect() for disabling truncation 2023-08-06 13:25:50 -07:00
c2664ae13f Give error when initializing tokenizer with too high stride (#1306)
* Split `get_n_added_tokens` into separate method

* Modify `TokenizerImpl.with_truncation()` to raise an error if given bad parameters

* Return Python error if `tokenizer.with_truncation()` fails

* Add dummy variable assignment for `no_truncation()` case

* Unrelated fmt fix.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-07-28 09:16:44 +02:00
b4fcc9ce6e Makes decode and decode_batch work on borrowed content. (#1251)
* Makes `decode` and `decode_batch` work on borrowed content.

* Make `decode_batch` work with borrowed content.

* Fix lint.

* Attempt to map it into Node.

* Second attempt.

* Step by step.

* One more step.

* Fix lint.

* Please ...

* Removing collect.

* Revert "Removing collect."

This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-17 11:18:15 +02:00
5c18ec5ff5 pyo3 v0.18 migration (#1173)
* pyo v0.18 migration

* Fix formatting issues of black
2023-03-08 11:27:47 +01:00
8129dd3309 pyo3: update to 0.17 (#1066)
* python: update bindings to edition 2021

* python: update to pyo3 0.17

* Updating testing.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-10-05 16:59:01 +02:00
519cc13be0 Upgrade pyo3 to 0.16 (#956)
* Upgrade pyo3 to 0.15

Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>

* Upgrade pyo3 to 0.16

Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>

* Install Python before running cargo clippy

* Fix clippy warnings

* Use `PyArray_Check` instead of downcasting to `PyArray1<u8>`

* Enable `auto-initialize` of pyo3 to fix `cargo test
--no-default-features`

* Fix some test cases

Why do they change?

* Refactor and add SAFETY comments to `PyArrayUnicode`

Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData`

Co-authored-by: messense <messense@icloud.com>
2022-05-05 15:48:40 +02:00
88d718207a tokenizer.save has the wrong arguments compared to documentation (#901)
* tokenizer.save has the wrong arguments compared to documentation

* Fixing doc of `save` function.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-02-15 17:55:55 +01:00
152880ab3e Adding truncation_side within TruncationParams. (#860)
* Add truncation to enable_truncation

* Fix typo

* Adding truncation_side within `TruncationParams`.

* Node serialization of this direction param.

* Update the test.

* Fixing warnings/lint.

* Adding stuff (can't local debug :( )

* Slow loop... ;(

* Stub.py.

Co-authored-by: Niels Rogge <niels.rogge1@gmail.com>
2021-12-28 12:37:06 +01:00
b8b584d4e5 Python - Pretty json saving defaults to true (#793)
* Python - Pretty json saving defaults to true

* Update changelog
2021-09-02 08:43:54 -04:00
6f9e867330 Better export for FromPretrainedParameters 2021-08-31 09:00:05 -04:00
e44fdee4a1 Python - Add bindings to Tokenizer.from_pretrained 2021-08-31 09:00:05 -04:00
56a9196030 Fix clippy warnings 2021-03-16 12:32:06 -04:00
817c5ad317 Fix clippy warnings for rust 1.49 2021-01-06 15:03:33 -05:00
5938a12b3f Python - Improve training with iterators 2021-01-06 11:38:43 -05:00
3a8627ce4d Improve docs and fix tests around training 2020-11-28 12:29:35 -05:00
999067454d Make sure we first try to extract a string 2020-11-28 12:29:35 -05:00
c36ac0bfdf Improve progress tracking while training 2020-11-28 12:29:35 -05:00
75deaecdd0 Also accept iterators of batches in train_from_iterator 2020-11-28 12:29:35 -05:00
e0a70f1fb2 Add ability to train from Iterator 2020-11-28 12:29:35 -05:00
a351d1c604 Python - Trainers can get/set their attributes 2020-11-27 17:35:34 -05:00
c22cfc31f9 Python - PyNormalizer & PyPreTokenizer use a RwLock 2020-11-27 17:35:34 -05:00
7f3cfebf45 Python - PyModel uses a RwLock to allow modifications 2020-11-27 17:35:34 -05:00
5059be1a8d Test BPE keeping its options after training 2020-11-20 13:30:44 -05:00
284a1dbee7 PyModel uses a RwLock to allow modifications 2020-11-20 13:30:44 -05:00
54c7210b2f Train Model in place
This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.
2020-11-20 13:30:44 -05:00
224862fe0c Python - Make the trainer optional on Tokenizer.train 2020-11-20 13:30:44 -05:00
352c92ad33 Automatically stubbing the pyi files while keeping inspecting ability (#509)
* First pass on automatic stubbing our python files.

* And now modifying all rust docs to be visible in Pyi files.

* Better assert fail message.

* Fixing github workflow.

* Removing types not exported anymore.

* Fixing `Tokenizer` signature.

* Disabling auto __init__.py.

* Re-enabling some types.

* Don't overwrite non automated __init__.py

* Automated most __init__.py

* Restubbing after rebase.

* Fixing env for tests.

* Install blakc in the env.

* Use PY35 target in stub.py

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-17 15:13:00 -05:00
a86d49634c Doc - API Reference for most Tokenizer methods/attributes 2020-11-02 17:07:27 -05:00
8c0370657e Doc - Update API Reference on more Tokenizer methods 2020-11-02 17:07:27 -05:00
ddabe130cd Doc - Updated API Reference for AddedToken 2020-11-02 17:07:27 -05:00
79f02bb7f0 Doc - Updated API Reference for encode/encode_batch 2020-11-02 17:07:27 -05:00
3ee54766e3 Doc - Backbone for API Reference 2020-11-02 17:07:27 -05:00
180371d929 Fixing hanging error while acquiring GIL from custom pretokenizer during training. (#470)
* Fixing hanging error while acquiring GIL from custom pretokenizer
during training.

Fixes #469

* cleanup

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-10-20 14:23:39 -04:00
8d04b22278 Python - Add support for custom Normalizer 2020-09-23 15:50:01 -04:00