* Fixing the vocab size of the trained Unigram model
* add test for the vocab size of the trained Unigram model
* Revert "add test for the vocab size of the trained Unigram model"
This reverts commit fb8955c831b357d1037548ceaa8789734d544646.
* Fixing the vocab size of the trained Unigram model
* format codes
* get the position of vocab-size calculation out of loop
* TMP.
* Adding support for pickling Python trainers.
* Remove not warranted files + missed naming updates.
* Stubbing.
* Making sure serialized format is written in python tests.
* tokenizer.save has the wrong arguments compared to documentation
* Fixing doc of `save` function.
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Fixing bad deserialization following inclusion of a default for
`Punctuation`.
* don't remove the type now...
* Adding slow test to run on all the tokenizers of the hub.
* `PartialEq` everywhere.
* Forcing `type` to exist on the `pre_tokenizers`.
* Starting from master again.
Upgrade libssl everywhere on quay
Extra is ubuntu based (running the quay in a container).
making only extra run + attempt to fix ssl update.
Extra with newer openssl versions.
`-y`.
Use checkoint@v2 + remove `-` from environment name.
Debugging back the conda release..
Attempt to use `base` env.
3.7 requires `activate-environement: true.
MacOS and windows don't run on manylinux.
Remove yum on windows/macOs.
Miniconda doesn't like manylinux2014 anymore ?
Attempting different approach for manylinux + conda.
Use wget.
Extra bracet.
Executing $filename
Activate the env.
Activate the env on eevery step that requires it.
Openssl-devel.
Activating env for extracting version ?
Retest all workflows.
Manylinux2010 requires checkout@v1
Run on tag for extra and conda again.
openssl-devel.
* Putting back into deploy state.
* Adding links in CHANGELOG.
* Remove clippy from changelog.
* feat(tokenizers): add truncate test case
* !feat(tokenizer): truncate right
* refacto(tokenizers): clippy
* feat(bindings): update bindings for truncate()
* fix(tokenizers): remove unsafe code
* refacto(tokenizers): truncate direction
* truncate direction enum
* compute parts ranges beforehand
* 2n space because encoding is dropped at the end of procedure
* update bindings
* add pip install in python bindings' make test
* fix(node): clippy asks to use unwrap_or_else
* fix(node): lint
* refacto(tokenizers): replace Vec<Range<usize>> by Vec<(usize, usize)>
* refacto(bindings): add match syntax
* refacto(tokenizers): use mem::replace instead of mem::swap
* refacto(tokenizers): assign value the normal way
* Switch git dependencies in Cargo.toml back to regular versions
rayon-cond turned out to be a rustc bug that has been fixed for a while
(see cuviper/rayon-cond#2), so we can revert the git dependency.
numpy has released the commit in question as part of 0.12.
* Also update Cargo.lock files
Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
* Doc - Fix typo (And instance of -> An instance of)
* Add missing text_signature for WordLevel.from_file
Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
* add a way to specify the unknown token in `SentencePieceUnigramTokenizer`
* add test that verify that an exception is raised for the missing unknown token
* style
* add test tokens