* implement a simple max_sentencepiece_length into BPE
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.
If this code were to be actually integrated some works to be done
Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length
I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
* implement a simple max_sentencepiece_length into BPE
Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.
If this code were to be actually integrated some works to be done
Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length
I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.
* utilize Option<u16> for safer code.
* Other version.
* Update trainer.rs
clarify with type usize propagate max_length option
* change max_length into more descriptive name
in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.
* change variable name in trainer.rs
change max_merge_length into max_token_length
* Update trainer.rs
add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer
Add explanation for variable shadowing.
* Update trainer.rs
Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.
* add max_token_length test
* Add bpe direct assert test
* Update trainer.rs
clarified test documentation
* Creating the bindings.
* Fix the default.
* Re-adding missing package-lock which I accidentally removed.
* ..
* Fixing trainer test.
* Fix.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Adding 2 new decoders:
- Fuse will simply concatenate all tokens into 1 string
- Strip will remove n char from left or right
Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want
for the `Metaspace` thing.
- Note: Added a new dependency from better parsing of decoders.
This is due to untagged enums which can match anything the `MustBe`
ensure there's no issue between Fuse and ByteFallback.
Since both are new the chances for backward incompatibility is low.
* Fixing picking/unpickling (using default args.).
* Stub.
* Black.
* Fixing node.
* Adding ByteFallback support for `tokenizers`.
Two items added:
- A flag `byte_fallback` for the `BPE` model. This will be in charge
of using `<0x61>` instead of unk on unknown tokens.
- A ByteFallback decoder, which will be in charge of putting everything
back into string whenever possible. Showing � when the byte decoding
fails (behavior checked against LlamaTokenizer in `transformers`.
* Update rustdoc.
* Clippy + Add BPE(byte_fallback) into bindings.
* Stupid file.
* Test artifacts removed.
* Update stub.
* Fix.
* Bad file.
* CRITICAL FIX: wrapper order because of untagged....
* Remove prints.
* Fixing <16 byte fallback.
* Fixing the vocab size of the trained Unigram model
* add test for the vocab size of the trained Unigram model
* Revert "add test for the vocab size of the trained Unigram model"
This reverts commit fb8955c831b357d1037548ceaa8789734d544646.
* Fixing the vocab size of the trained Unigram model
* format codes
* get the position of vocab-size calculation out of loop
* TMP.
* Adding support for pickling Python trainers.
* Remove not warranted files + missed naming updates.
* Stubbing.
* Making sure serialized format is written in python tests.
* Fixing bad deserialization following inclusion of a default for
`Punctuation`.
* don't remove the type now...
* Adding slow test to run on all the tokenizers of the hub.
* `PartialEq` everywhere.
* Forcing `type` to exist on the `pre_tokenizers`.
* add a way to specify the unknown token in `SentencePieceUnigramTokenizer`
* add test that verify that an exception is raised for the missing unknown token
* style
* add test tokens
* Rust - add a CTCDecoder as a seperate mod
* Adding bindings to Node + Python.
* Clippy update.
* Stub.
* Fixing roberta.json URLs.
* Moving test files to hf.co.
* Update cargo check and clippy to 1.52.
* Inner ':' actually is used for domains in sphinx.
Making `domain` work correctly was just too much work so I went the easy
way and have global roles for the custom rust extension.
* Update struct naming and docs
* Update changelog
Co-authored-by: Thomaub <github.thomaub@gmail.com>
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
* start playing around
* make a first version
* refactor
* apply make format
* add python bindings
* add some python binding tests
* correct pre-tokenizers
* update auto-generated bindings
* lint python bindings
* add code node
* add split to docs
* refactor python binding a bit
* cargo fmt
* clippy and fmt in node
* quick updates and fixes
* Oops
* Update node typings
* Update changelog
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>