* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* add a small test
* styling
* style merge
* fix merge test
* fmt
* nits
* update tset
* version = "0.15.3-dev-0”
Improve performances of meta space, but also just fix it.
(transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
[0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848]
(transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
[0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]
* well what do we have
* nit
* be BC with non legacy
* unrelated change for clippy
* fix test
* splitting is a must for word_ids
* fmt and lint
* Fixing everything (hopefully better).
* Fixing node.
* Including yarn.lock
* Lint.
* Stubs.
* revert to use split
* fix merge issues
* fix tests
* finish fixing tests
* ruff
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Move to maturing mimicking move for `safetensors`.
* Tmp.
* Fix sdist.
* Wat?
* Clippy 1.72
* Remove if.
* Conda sed.
* Fix doc check workflow.
* Moving to maturin AND removing http + openssl mess (smoothing transition
moving to `huggingface_hub`)
* Fix dep
* Black.
* New node bindings.
* Fix docs + node cache ?
* Yarn.
* Working dir.
* Extension module.
* Put back interpreter.
* Remove cache.
* New attempt
* Multi python.
* Remove FromPretrained.
* Remove traces of `fromPretrained`.
* Drop 3.12 for windows?
* Typo.
* Put back the default feature for ignoring links during simple test.
* Fix ?
* x86_64 -> x64.
* Remove warning for windows bindings.
* Excluse aarch.
* Include/exclude.
* Put back workflows in correct states.
* CD backports
follow
huggingface/safetensors#317
* fix node bindings?
`cargo check` doesnt work on my local configuration from `tokenizers/bindings/node/native`
i don't think it will be a problem but i have difficulty telling
* backport #315
* safetensors#317 back ports
* Makes `decode` and `decode_batch` work on borrowed content.
* Make `decode_batch` work with borrowed content.
* Fix lint.
* Attempt to map it into Node.
* Second attempt.
* Step by step.
* One more step.
* Fix lint.
* Please ...
* Removing collect.
* Revert "Removing collect."
This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* implement a simple max_sentencepiece_length into BPE
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.
If this code were to be actually integrated some works to be done
Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length
I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
* implement a simple max_sentencepiece_length into BPE
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.
If this code were to be actually integrated some works to be done
Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length
I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
* utilize Option<u16> for safer code.
* Other version.
* Update trainer.rs
clarify with type usize propagate max_length option
* change max_length into more descriptive name
in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.
* change variable name in trainer.rs
change max_merge_length into max_token_length
* Update trainer.rs
add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer
Add explanation for variable shadowing.
* Update trainer.rs
Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.
* add max_token_length test
* Add bpe direct assert test
* Update trainer.rs
clarified test documentation
* Creating the bindings.
* Fix the default.
* Re-adding missing package-lock which I accidentally removed.
* ..
* Fixing trainer test.
* Fix.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Adding 2 new decoders:
- Fuse will simply concatenate all tokens into 1 string
- Strip will remove n char from left or right
Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want
for the `Metaspace` thing.
- Note: Added a new dependency from better parsing of decoders.
This is due to untagged enums which can match anything the `MustBe`
ensure there's no issue between Fuse and ByteFallback.
Since both are new the chances for backward incompatibility is low.
* Fixing picking/unpickling (using default args.).
* Stub.
* Black.
* Fixing node.
* Adding ByteFallback support for `tokenizers`.
Two items added:
- A flag `byte_fallback` for the `BPE` model. This will be in charge
of using `<0x61>` instead of unk on unknown tokens.
- A ByteFallback decoder, which will be in charge of putting everything
back into string whenever possible. Showing � when the byte decoding
fails (behavior checked against LlamaTokenizer in `transformers`.
* Update rustdoc.
* Clippy + Add BPE(byte_fallback) into bindings.
* Stupid file.
* Test artifacts removed.
* Update stub.
* Fix.
* Bad file.
* CRITICAL FIX: wrapper order because of untagged....
* Remove prints.
* Fixing <16 byte fallback.
* Adding rust audit.
* Update clap version + derive_builder (they clashed).
* Ignoring specific CVE which can be ignored
https://github.com/Azure/iot-identity-service/issues/481
* Updating python lock.
* Revert `derive-builder` update.
* Adding back help msg.
* New version.
The actual release will happen *before* PyO3 0.17.2 because
the tests were ran before than.
* Manylinux2014 necessary now with Rust 1.64.
* Update README.md
Add reference to normalizer blog post
* Update lib.rs
* Fixing PR + clippy on node.
* Update readme to match docstring.
* Other clippy warning.
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>