* version = "0.15.3-dev-0”
Improve performances of meta space, but also just fix it.
(transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
[0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848]
(transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
[0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]
* well what do we have
* nit
* be BC with non legacy
* unrelated change for clippy
* fix test
* splitting is a must for word_ids
* fmt and lint
* Fixing everything (hopefully better).
* Fixing node.
* Including yarn.lock
* Lint.
* Stubs.
* revert to use split
* fix merge issues
* fix tests
* finish fixing tests
* ruff
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* add doc in the code
* add option to skip special tokens
* nits
* add api dummy for now
* Fmt.
* Fix fmt.
* Fix the stub.
* add a test
* add a test in python
* style it
* nits
* add getter and setters
* stub
* update python test
* fmt
* last nit
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* nits
* allow for legacy beahaviour without making any breaking changes
* add a todo
* set to legacy by default
* skip legacy serialization
* push correct update
* lint
* add deserialization test
* add a python test as well
* updates
* fix serialization tests
* nits
* python stylijng of the tests
* better tests
* fix offsets
* fix imports
* fmt
* update metaspace
* remove TODO
* use enm
* fix some tses
* nits
* use enum
* update tests
* syling
* remove impl from for PrependScheme
* use simple getters and setters
* lint
* update tests
* add test new == new_with_prepend_scheme
* revert a change
* use setters and getterts
* Update bindings/python/src/pre_tokenizers.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* nits
* use copy rather than ref
* nits format
* more nits
* allow option string
* enforce First Never Always camel cased
* nits
* refactor
* update test as well
* fmt
* nits
* properly error out
* Update bindings/python/src/pre_tokenizers.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* suggestion changes
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Fixing the progressbar.
* Upgrade deps.
* Update cargo audit
* Ssh this action.
* Fixing esaxx by using slower rust version.
* Trying the new esaxx version.
* Publish.
* Get cache again.