* [Breaking Change] Perf improvement 16% by removing offsets.
Offsets calculation are always calculated in Python land.
By changing it to not being calculated, we win 16% of the runtime.
This is not the total extent of it because offsets are
still calculated in bytes.
* Required features.
* Remove clippy error.
* Make it non breaking and still show perf improvement.
* Even faster without offsets.
* Update doc.
* Fmt.
* Apply suggestions from code review
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* fmt.
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* add a small test
* styling
* style merge
* fix merge test
* fmt
* nits
* update tset
* Merges cannot handle tokens containing spaces.
This fixes this while keeping backward support.
We don't want to merge that blindly.
* Update the tests.
* Fixing clippy.
* Add a test with spaces in the token/merge.
* Using serde (serde_pyo3) to get __str__ and __repr__ easily.
* Putting it within tokenizers, it needs to be too specific.
* Clippy is our friend.
* Ruff.
* Update the tests.
* Pretty sure this is wrong (#1589)
* Adding support for ellipsis.
* Fmt.
* Ruff.
* Fixing tokenizer.
---------
Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
* feature dependent test
* nit about 嗎
* update
* actuallyfix it
* update the test
add it
fix
* stub
* Update tokenizers/src/pre_tokenizers/byte_level.rs
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* skip failing test
* add normalizer to init
---------
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)"
This reverts commit 25aee8b88c.
* don't remove audit
* deprecate id_to_token
* use simple id to token
* don't break id_to_token since we are deprecating anyways?
* [BREAKING CHANGE] Ignore added_tokens (both special and not) in the
decoder
Causes issues with `ByteLevel` messing up some `AddedTokens` with some
utf-8 range used in the bytelevel mapping.
This commit tests the extend of the damage of ignoring the decoder for
those tokens.
* Format.
* Installing cargo audit.
* Minor fix.
* Fixing "bug" in node/python.
* Autoformat.
* Clippy.
* Only prefix space when there's no decoder.
* remove enforcement of non special when adding tokens
* mut no longer needed
* add a small test
* nit
* style
* audit
* ignore cargo audit's own vulnerability
* update
* revert
* remove CVE