* Using serde (serde_pyo3) to get __str__ and __repr__ easily.
* Putting it within tokenizers, it needs to be too specific.
* Clippy is our friend.
* Ruff.
* Update the tests.
* Pretty sure this is wrong (#1589)
* Adding support for ellipsis.
* Fmt.
* Ruff.
* Fixing tokenizer.
---------
Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
* feature dependent test
* nit about 嗎
* update
* actuallyfix it
* update the test
add it
fix
* stub
* Update tokenizers/src/pre_tokenizers/byte_level.rs
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* skip failing test
* add normalizer to init
---------
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)"
This reverts commit 25aee8b88c.
* don't remove audit
* deprecate id_to_token
* use simple id to token
* don't break id_to_token since we are deprecating anyways?
* [BREAKING CHANGE] Ignore added_tokens (both special and not) in the
decoder
Causes issues with `ByteLevel` messing up some `AddedTokens` with some
utf-8 range used in the bytelevel mapping.
This commit tests the extend of the damage of ignoring the decoder for
those tokens.
* Format.
* Installing cargo audit.
* Minor fix.
* Fixing "bug" in node/python.
* Autoformat.
* Clippy.
* Only prefix space when there's no decoder.
* remove enforcement of non special when adding tokens
* mut no longer needed
* add a small test
* nit
* style
* audit
* ignore cargo audit's own vulnerability
* update
* revert
* remove CVE
* version = "0.15.3-dev-0”
Improve performances of meta space, but also just fix it.
(transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
[0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848]
(transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
[0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]
* well what do we have
* nit
* be BC with non legacy
* unrelated change for clippy
* fix test
* splitting is a must for word_ids
* fmt and lint
* Fixing everything (hopefully better).
* Fixing node.
* Including yarn.lock
* Lint.
* Stubs.
* revert to use split
* fix merge issues
* fix tests
* finish fixing tests
* ruff
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>