Commit Graph

46 Commits

Author SHA1 Message Date
f4c9fd7f40 Testing ABI3 wheels to reduce number of wheels (#1674)
* Testing ABI3 wheels to reduce number of wheels

* No need for py-clone  anymore.

* Upgrade python versions.

* Remove those flakes.

* Promoting new CI + Fixing secret.
2024-11-15 06:02:22 +01:00
6ade8c2d21 PyO3 0.22 (#1665)
* PyO3 0.22

* Fix python stubs

* Remove name arg from PyModel::save Python signature

---------

Co-authored-by: Dimitris Iliopoulos <diliopoulos@fb.com>
2024-11-01 10:17:23 +01:00
ab9c7ded8b Using serde (serde_pyo3) to get __str__ and __repr__ easily. (#1588)
* Using serde (serde_pyo3) to get __str__ and __repr__ easily.

* Putting it within tokenizers, it needs to be too specific.

* Clippy is our friend.

* Ruff.

* Update the tests.

* Pretty sure this is wrong (#1589)

* Adding support for ellipsis.

* Fmt.

* Ruff.

* Fixing tokenizer.

---------

Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
2024-08-07 12:08:29 +02:00
a010f6b75c Revert "Using serde (serde_pyo3) to get __str__ and __repr__ easily."
This reverts commit 86138337fc.
2024-08-02 18:42:57 +02:00
86138337fc Using serde (serde_pyo3) to get __str__ and __repr__ easily. 2024-08-02 18:41:54 +02:00
d5a8cc7a49 PyO3 0.21. (#1494)
* PyO3 0.21.

* Upgraded everything.

* Rustfmt.
2024-04-16 13:49:52 +02:00
4a8105c366 Convert word counts to u64 (#1433)
* Convert word counts to u64

* More spots needed to compile
2024-02-06 03:39:12 +01:00
08af8ea9c3 make tests happy 2023-09-05 15:37:09 +00:00
d9829cdc6e fix more tests 2023-09-04 17:22:27 +00:00
39bd27e673 fix build 2023-09-01 21:22:07 +00:00
540bf2eb01 pyo3: update to 0.19 (#1322)
* Bump pyo3 dependency versions

* Fix deprecation warnings from pyo3

---------

Co-authored-by: Mike Lui <mikelui@meta.com>
2023-08-16 18:40:32 +02:00
cefc41e8ec implement a simple max_sentencepiece_length into BPE (#1228)
* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* utilize Option<u16> for safer code.

* Other version.

* Update trainer.rs

clarify with type usize propagate max_length option

* change max_length into more descriptive name

in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.

* change variable name in trainer.rs

change max_merge_length into max_token_length

* Update trainer.rs

add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer

Add explanation for variable shadowing.

* Update trainer.rs

Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.

* add max_token_length test

* Add bpe direct assert test

* Update trainer.rs

clarified test documentation

* Creating the bindings.

* Fix the default.

* Re-adding missing package-lock which I accidentally removed.

* ..

* Fixing trainer test.

* Fix.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-16 10:08:19 +02:00
5c18ec5ff5 pyo3 v0.18 migration (#1173)
* pyo v0.18 migration

* Fix formatting issues of black
2023-03-08 11:27:47 +01:00
8129dd3309 pyo3: update to 0.17 (#1066)
* python: update bindings to edition 2021

* python: update to pyo3 0.17

* Updating testing.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-10-05 16:59:01 +02:00
519cc13be0 Upgrade pyo3 to 0.16 (#956)
* Upgrade pyo3 to 0.15

Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>

* Upgrade pyo3 to 0.16

Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>

* Install Python before running cargo clippy

* Fix clippy warnings

* Use `PyArray_Check` instead of downcasting to `PyArray1<u8>`

* Enable `auto-initialize` of pyo3 to fix `cargo test
--no-default-features`

* Fix some test cases

Why do they change?

* Refactor and add SAFETY comments to `PyArrayUnicode`

Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData`

Co-authored-by: messense <messense@icloud.com>
2022-05-05 15:48:40 +02:00
4b6055d4fb Adding pickling support for trainers (#949)
* TMP.

* Adding support for pickling Python trainers.

* Remove not warranted files + missed naming updates.

* Stubbing.

* Making sure serialized format is written in python tests.
2022-03-14 12:18:11 +01:00
6972e49f1d Fix the clippy warnings. (#869) 2022-01-04 14:32:07 +01:00
6616e699f7 Expand documentation of UnigramTrainer (#770)
* Expand documentation of UnigramTrainer

* Put doc at the source

* Add signature

* make style

Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
2021-08-12 10:12:26 -04:00
d83772d62c Fixing tokenizers with 1.53 (updated some dependencies + clippy) (#764) 2021-07-21 09:58:38 +02:00
e0a70f1fb2 Add ability to train from Iterator 2020-11-28 12:29:35 -05:00
a351d1c604 Python - Trainers can get/set their attributes 2020-11-27 17:35:34 -05:00
58e1d8de67 Python - Improve documentation for trainers 2020-11-23 11:52:51 -05:00
13e07da2c8 Node - Add WordLevelTrainer 2020-11-20 13:30:44 -05:00
284a1dbee7 PyModel uses a RwLock to allow modifications 2020-11-20 13:30:44 -05:00
54c7210b2f Train Model in place
This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.
2020-11-20 13:30:44 -05:00
c230183cf6 A Model can return its associated Trainer 2020-11-20 13:30:44 -05:00
059d43b265 Add WordLevel trainer 2020-11-20 13:30:44 -05:00
352c92ad33 Automatically stubbing the pyi files while keeping inspecting ability (#509)
* First pass on automatic stubbing our python files.

* And now modifying all rust docs to be visible in Pyi files.

* Better assert fail message.

* Fixing github workflow.

* Removing types not exported anymore.

* Fixing `Tokenizer` signature.

* Disabling auto __init__.py.

* Re-enabling some types.

* Don't overwrite non automated __init__.py

* Automated most __init__.py

* Restubbing after rebase.

* Fixing env for tests.

* Install blakc in the env.

* Use PY35 target in stub.py

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-17 15:13:00 -05:00
1a6f4b5204 Allow initial_alphabet on UnigramTrainer 2020-10-26 10:57:29 -04:00
940f8bd8fa Update PyO3 (#426) 2020-09-22 12:00:20 -04:00
52082b5476 New clippy comments? 2020-09-02 16:32:50 +02:00
c0798acacf Address @n1t0 comments. 2020-09-02 16:32:50 +02:00
d624645cf3 Attempting to add UnigramTrainer to python bindings. 2020-09-02 16:32:50 +02:00
ac8af63f70 Trainers don't need Arc. 2020-08-04 15:59:33 -04:00
83a52c8080 Replace Model and Trainer Containers.
* Implement changes necessary from generic Model in Tokenizer.
* Temporarily disable training in Python since Clone can't be
  derived for Model until all components have been replaced.
* Prefix Python types in Rust with Py.
2020-08-04 15:59:33 -04:00
c02d4e2202 Python - Improve AddedToken interface 2020-06-19 17:53:46 -04:00
2dc48e56ac Python - Update pyo3 version
* Use __new__ instead of static method as model constructors
2020-04-06 21:20:16 +02:00
c65d53892d Python - Add bindings for new AddedToken options 2020-03-24 20:58:45 -04:00
08ce105195 Python - Hotfix WordPieceTrainer constructor 2020-02-11 08:13:57 -05:00
4971e9608d Implement __new__ on Trainers
__new__ allows Trainers to be initialized in the normal python
fashion.
2020-02-10 10:43:29 +01:00
ef21c9a7b0 Hotfix for new Builder
cc @epwalsh
2020-01-08 16:19:51 -05:00
c51e340492 Python - Add WordPieceTrainer 2020-01-03 19:37:29 -05:00
e64b54b29e Python - Update BpeTrainer interface 2020-01-03 19:37:29 -05:00
0589deb6e2 Python - Expose BpeTrainer options 2020-01-02 18:09:04 -05:00
c0ed873c4d simplify initialization of BpeTrainer 2019-12-23 20:13:48 -05:00
eaafb22511 Add bindings for Trainer in Python 2019-12-03 15:54:15 -05:00