* CD backports
follow
huggingface/safetensors#317
* fix node bindings?
`cargo check` doesnt work on my local configuration from `tokenizers/bindings/node/native`
i don't think it will be a problem but i have difficulty telling
* backport #315
* safetensors#317 back ports
* Split `get_n_added_tokens` into separate method
* Modify `TokenizerImpl.with_truncation()` to raise an error if given bad parameters
* Return Python error if `tokenizer.with_truncation()` fails
* Add dummy variable assignment for `no_truncation()` case
* Unrelated fmt fix.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* fix documentation regarding regex
Split() in pre_tokenizers.rs and normalizations take a regex that is required to be built with a tokenizer specific regex module.
Clarify this in the documentation.
* Update __init__.pyi
fixed __init__.pyi
* Update bindings/python/py_src/tokenizers/__init__.pyi
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Update bindings/python/py_src/tokenizers/__init__.pyi
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Revert "Update bindings/python/py_src/tokenizers/__init__.pyi"
This reverts commit 6e8bdfcddf67bcdd8e3b1a78685fd5ef8f6a153c.
* Revert "Update bindings/python/py_src/tokenizers/__init__.pyi"
This reverts commit 897b0c0de471ad7cb6269b8456347c4e5cff2aaf.
* Revert "Update __init__.pyi"
This reverts commit fbe82310b7728ee7cdb6f8b38fbc2388f9d95771.
* add codeblocks the right way
* add codeblocks with stub.py
ran setup.py install to build, and then ran stub.py
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Makes `decode` and `decode_batch` work on borrowed content.
* Make `decode_batch` work with borrowed content.
* Fix lint.
* Attempt to map it into Node.
* Second attempt.
* Step by step.
* One more step.
* Fix lint.
* Please ...
* Removing collect.
* Revert "Removing collect."
This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* implement a simple max_sentencepiece_length into BPE
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.
If this code were to be actually integrated some works to be done
Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length
I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
* implement a simple max_sentencepiece_length into BPE
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.
If this code were to be actually integrated some works to be done
Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length
I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
* utilize Option<u16> for safer code.
* Other version.
* Update trainer.rs
clarify with type usize propagate max_length option
* change max_length into more descriptive name
in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.
* change variable name in trainer.rs
change max_merge_length into max_token_length
* Update trainer.rs
add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer
Add explanation for variable shadowing.
* Update trainer.rs
Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.
* add max_token_length test
* Add bpe direct assert test
* Update trainer.rs
clarified test documentation
* Creating the bindings.
* Fix the default.
* Re-adding missing package-lock which I accidentally removed.
* ..
* Fixing trainer test.
* Fix.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>