* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* add a small test
* styling
* style merge
* fix merge test
* fmt
* nits
* update tset
* Using serde (serde_pyo3) to get __str__ and __repr__ easily.
* Putting it within tokenizers, it needs to be too specific.
* Clippy is our friend.
* Ruff.
* Update the tests.
* Pretty sure this is wrong (#1589)
* Adding support for ellipsis.
* Fmt.
* Ruff.
* Fixing tokenizer.
---------
Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
* feature dependent test
* nit about 嗎
* update
* actuallyfix it
* update the test
add it
fix
* stub
* Update tokenizers/src/pre_tokenizers/byte_level.rs
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* skip failing test
* add normalizer to init
---------
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* remove enforcement of non special when adding tokens
* mut no longer needed
* add a small test
* nit
* style
* audit
* ignore cargo audit's own vulnerability
* update
* revert
* remove CVE
* version = "0.15.3-dev-0”
Improve performances of meta space, but also just fix it.
(transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
[0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848]
(transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
[0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]
* well what do we have
* nit
* be BC with non legacy
* unrelated change for clippy
* fix test
* splitting is a must for word_ids
* fmt and lint
* Fixing everything (hopefully better).
* Fixing node.
* Including yarn.lock
* Lint.
* Stubs.
* revert to use split
* fix merge issues
* fix tests
* finish fixing tests
* ruff
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* add doc in the code
* add option to skip special tokens
* nits
* add api dummy for now
* Fmt.
* Fix fmt.
* Fix the stub.
* add a test
* add a test in python
* style it
* nits
* add getter and setters
* stub
* update python test
* fmt
* last nit
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* nits
* allow for legacy beahaviour without making any breaking changes
* add a todo
* set to legacy by default
* skip legacy serialization
* push correct update
* lint
* add deserialization test
* add a python test as well
* updates
* fix serialization tests
* nits
* python stylijng of the tests
* better tests
* fix offsets
* fix imports
* fmt
* update metaspace
* remove TODO
* use enm
* fix some tses
* nits
* use enum
* update tests
* syling
* remove impl from for PrependScheme
* use simple getters and setters
* lint
* update tests
* add test new == new_with_prepend_scheme
* revert a change
* use setters and getterts
* Update bindings/python/src/pre_tokenizers.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* nits
* use copy rather than ref
* nits format
* more nits
* allow option string
* enforce First Never Always camel cased
* nits
* refactor
* update test as well
* fmt
* nits
* properly error out
* Update bindings/python/src/pre_tokenizers.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* suggestion changes
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* implement a simple max_sentencepiece_length into BPE
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.
If this code were to be actually integrated some works to be done
Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length
I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
* implement a simple max_sentencepiece_length into BPE
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.
If this code were to be actually integrated some works to be done
Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length
I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
* utilize Option<u16> for safer code.
* Other version.
* Update trainer.rs
clarify with type usize propagate max_length option
* change max_length into more descriptive name
in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.
* change variable name in trainer.rs
change max_merge_length into max_token_length
* Update trainer.rs
add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer
Add explanation for variable shadowing.
* Update trainer.rs
Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.
* add max_token_length test
* Add bpe direct assert test
* Update trainer.rs
clarified test documentation
* Creating the bindings.
* Fix the default.
* Re-adding missing package-lock which I accidentally removed.
* ..
* Fixing trainer test.
* Fix.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Adding 2 new decoders:
- Fuse will simply concatenate all tokens into 1 string
- Strip will remove n char from left or right
Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want
for the `Metaspace` thing.
- Note: Added a new dependency from better parsing of decoders.
This is due to untagged enums which can match anything the `MustBe`
ensure there's no issue between Fuse and ByteFallback.
Since both are new the chances for backward incompatibility is low.
* Fixing picking/unpickling (using default args.).
* Stub.
* Black.
* Fixing node.
* Adding ByteFallback support for `tokenizers`.
Two items added:
- A flag `byte_fallback` for the `BPE` model. This will be in charge
of using `<0x61>` instead of unk on unknown tokens.
- A ByteFallback decoder, which will be in charge of putting everything
back into string whenever possible. Showing � when the byte decoding
fails (behavior checked against LlamaTokenizer in `transformers`.
* Update rustdoc.
* Clippy + Add BPE(byte_fallback) into bindings.
* Stupid file.
* Test artifacts removed.
* Update stub.
* Fix.
* Bad file.
* CRITICAL FIX: wrapper order because of untagged....
* Remove prints.
* Fixing <16 byte fallback.
* Fixing the vocab size of the trained Unigram model
* add test for the vocab size of the trained Unigram model
* Revert "add test for the vocab size of the trained Unigram model"
This reverts commit fb8955c831b357d1037548ceaa8789734d544646.
* Fixing the vocab size of the trained Unigram model
* format codes
* get the position of vocab-size calculation out of loop
* TMP.
* Adding support for pickling Python trainers.
* Remove not warranted files + missed naming updates.
* Stubbing.
* Making sure serialized format is written in python tests.
* Fixing bad deserialization following inclusion of a default for
`Punctuation`.
* don't remove the type now...
* Adding slow test to run on all the tokenizers of the hub.
* `PartialEq` everywhere.
* Forcing `type` to exist on the `pre_tokenizers`.
* add a way to specify the unknown token in `SentencePieceUnigramTokenizer`
* add test that verify that an exception is raised for the missing unknown token
* style
* add test tokens
* Rust - add a CTCDecoder as a seperate mod
* Adding bindings to Node + Python.
* Clippy update.
* Stub.
* Fixing roberta.json URLs.
* Moving test files to hf.co.
* Update cargo check and clippy to 1.52.
* Inner ':' actually is used for domains in sphinx.
Making `domain` work correctly was just too much work so I went the easy
way and have global roles for the custom rust extension.
* Update struct naming and docs
* Update changelog
Co-authored-by: Thomaub <github.thomaub@gmail.com>
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>