* Testing ABI3 wheels to reduce number of wheels
* No need for py-clone anymore.
* Upgrade python versions.
* Remove those flakes.
* Promoting new CI + Fixing secret.
* Using serde (serde_pyo3) to get __str__ and __repr__ easily.
* Putting it within tokenizers, it needs to be too specific.
* Clippy is our friend.
* Ruff.
* Update the tests.
* Pretty sure this is wrong (#1589)
* Adding support for ellipsis.
* Fmt.
* Ruff.
* Fixing tokenizer.
---------
Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
* implement a simple max_sentencepiece_length into BPE
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.
If this code were to be actually integrated some works to be done
Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length
I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
* implement a simple max_sentencepiece_length into BPE
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.
If this code were to be actually integrated some works to be done
Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length
I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
* utilize Option<u16> for safer code.
* Other version.
* Update trainer.rs
clarify with type usize propagate max_length option
* change max_length into more descriptive name
in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.
* change variable name in trainer.rs
change max_merge_length into max_token_length
* Update trainer.rs
add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer
Add explanation for variable shadowing.
* Update trainer.rs
Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.
* add max_token_length test
* Add bpe direct assert test
* Update trainer.rs
clarified test documentation
* Creating the bindings.
* Fix the default.
* Re-adding missing package-lock which I accidentally removed.
* ..
* Fixing trainer test.
* Fix.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Upgrade pyo3 to 0.15
Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>
* Upgrade pyo3 to 0.16
Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>
* Install Python before running cargo clippy
* Fix clippy warnings
* Use `PyArray_Check` instead of downcasting to `PyArray1<u8>`
* Enable `auto-initialize` of pyo3 to fix `cargo test
--no-default-features`
* Fix some test cases
Why do they change?
* Refactor and add SAFETY comments to `PyArrayUnicode`
Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData`
Co-authored-by: messense <messense@icloud.com>
* TMP.
* Adding support for pickling Python trainers.
* Remove not warranted files + missed naming updates.
* Stubbing.
* Making sure serialized format is written in python tests.
This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.
* First pass on automatic stubbing our python files.
* And now modifying all rust docs to be visible in Pyi files.
* Better assert fail message.
* Fixing github workflow.
* Removing types not exported anymore.
* Fixing `Tokenizer` signature.
* Disabling auto __init__.py.
* Re-enabling some types.
* Don't overwrite non automated __init__.py
* Automated most __init__.py
* Restubbing after rebase.
* Fixing env for tests.
* Install blakc in the env.
* Use PY35 target in stub.py
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
* Implement changes necessary from generic Model in Tokenizer.
* Temporarily disable training in Python since Clone can't be
derived for Model until all components have been replaced.
* Prefix Python types in Rust with Py.