* refactor: lift cloning to caller
* refactor: do not elide lifetimes as in Rust 2018
* fix: unsound use of env::set_var, was breaking stdlib change to make unsafe
It is generally not safe to set env variables. The correct way to set a config
value that needs to be overridden is to hold a copy internal to the library and
only read from the environment.
* [Breaking Change] Perf improvement 16% by removing offsets.
Offsets calculation are always calculated in Python land.
By changing it to not being calculated, we win 16% of the runtime.
This is not the total extent of it because offsets are
still calculated in bytes.
* Required features.
* Remove clippy error.
* Make it non breaking and still show perf improvement.
* Even faster without offsets.
* Update doc.
* Fmt.
* Apply suggestions from code review
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* fmt.
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* add a small test
* styling
* style merge
* fix merge test
* fmt
* nits
* update tset
* Merges cannot handle tokens containing spaces.
This fixes this while keeping backward support.
We don't want to merge that blindly.
* Update the tests.
* Fixing clippy.
* Add a test with spaces in the token/merge.
* feature dependent test
* nit about 嗎
* update
* actuallyfix it
* update the test
add it
fix
* stub
* Update tokenizers/src/pre_tokenizers/byte_level.rs
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* skip failing test
* add normalizer to init
---------
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)"
This reverts commit 25aee8b88c.
* don't remove audit
* deprecate id_to_token
* use simple id to token
* don't break id_to_token since we are deprecating anyways?
* [BREAKING CHANGE] Ignore added_tokens (both special and not) in the
decoder
Causes issues with `ByteLevel` messing up some `AddedTokens` with some
utf-8 range used in the bytelevel mapping.
This commit tests the extend of the damage of ignoring the decoder for
those tokens.
* Format.
* Installing cargo audit.
* Minor fix.
* Fixing "bug" in node/python.
* Autoformat.
* Clippy.
* Only prefix space when there's no decoder.
* version = "0.15.3-dev-0”
Improve performances of meta space, but also just fix it.
(transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
[0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848]
(transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
[0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]
* well what do we have
* nit
* be BC with non legacy
* unrelated change for clippy
* fix test
* splitting is a must for word_ids
* fmt and lint
* Fixing everything (hopefully better).
* Fixing node.
* Including yarn.lock
* Lint.
* Stubs.
* revert to use split
* fix merge issues
* fix tests
* finish fixing tests
* ruff
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>