Nicolas Patry
23a22da18c
Update the builder to use earlier windows version (2022) is not understood. ( #969 )
...
* Update the builder to use earlier windows version (2022) is not
understood.
* No node for windows.
* Ready to deploy.
2022-03-31 15:00:11 +02:00
Nicolas Patry
0eb7455fe5
Preparing 0.12 release. ( #967 )
...
* Preparing `0.12` release.
* Fix click version: https://github.com/psf/black/issues/2964
2022-03-31 11:06:33 +02:00
dependabot[bot]
28cd3dce2a
Bump minimist from 1.2.5 to 1.2.6 in /bindings/node ( #966 )
...
Bumps [minimist](https://github.com/substack/minimist ) from 1.2.5 to 1.2.6.
- [Release notes](https://github.com/substack/minimist/releases )
- [Commits](https://github.com/substack/minimist/compare/1.2.5...1.2.6 )
---
updated-dependencies:
- dependency-name: minimist
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-03-28 09:52:43 +02:00
Nicolas Patry
20ec60aeba
Adding a link to the Ruby port of tokenizers ( #961 )
2022-03-24 17:09:30 +01:00
Nicolas Patry
28fe0e40e7
Preventing yelling on empty OrderedVocab (triggered by pickle.dumps). ( #963 )
2022-03-24 17:09:18 +01:00
Nicolas Patry
a5f644616b
Fix the error test for Python 3.10 (error message is different). ( #962 )
2022-03-23 10:35:58 +01:00
MarcusGrass
5a79b71b1d
Feature gate CLI and clap dependency ( #960 )
2022-03-22 10:55:53 +01:00
Nicolas Patry
cd730594e9
Fixing issue with ConvBert not being able to save because of of holes in ( #954 )
...
the vocab.
2022-03-21 19:28:49 +01:00
Kaito Sugimoto
1bb9884f45
Fixing the vocab size of the trained Unigram model ( #952 )
...
* Fixing the vocab size of the trained Unigram model
* add test for the vocab size of the trained Unigram model
* Revert "add test for the vocab size of the trained Unigram model"
This reverts commit fb8955c831b357d1037548ceaa8789734d544646.
* Fixing the vocab size of the trained Unigram model
* format codes
* get the position of vocab-size calculation out of loop
2022-03-18 18:13:17 +01:00
Nicolas Patry
daa4dd2288
Making the regex in ByteLevel optional. ( #939 )
...
* Making the regex in ByteLevel optional.
* Changed the stub.
* Beter stub.
* Typo fix.
* Remove bad comments.
2022-03-18 09:03:20 +01:00
Nicolas Patry
cdabef14c4
Changing Decoder trait to be more composable. ( #938 )
...
* Changing `Decoder` trait to be more composable.
Fix #872
* Fixing Python side.
* Fixing test.
* Updating cleanup signature, removing turbofish.
2022-03-17 10:32:09 +01:00
Mishig Davaadorj
1f1f86dd32
Use thiserror crate for Errors ( #951 )
...
* Use `thiserror` crate for Errors
* cargo fmt
* `#[source]` redundant when `#[from]` is present
2022-03-17 09:38:21 +01:00
Nicolas Patry
4b6055d4fb
Adding pickling support for trainers ( #949 )
...
* TMP.
* Adding support for pickling Python trainers.
* Remove not warranted files + missed naming updates.
* Stubbing.
* Making sure serialized format is written in python tests.
2022-03-14 12:18:11 +01:00
dctelus
71ae5421eb
Python - add initial_alphabet to spm unigram trainer ( #942 )
...
* Python - add initial_alphabet to spm unigram trainer
* Python - use optional instead of mutable defaults in spm unigram trainer
2022-03-09 09:54:03 +01:00
dctelus
98249dfb0f
Python - add doctype to length in implementations spm unigram ( #943 )
2022-03-08 11:59:07 +01:00
dctelus
4a8f5db067
Python - Add length to train_from_iterator in implementations ( #937 )
2022-03-04 14:11:58 +01:00
Luc Georges
845da6d8e8
Feat/m1 manual build ( #936 )
...
* feat(bindings): move target compilation flags to correct config file
* feat(bindings): m1 build 'script'
* feat(ci): for loop in bdist_wheel script for m1
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2022-03-02 14:44:13 +01:00
Nicolas Patry
03e10b606d
Publishing a crate does not require a specific OS.
2022-02-28 11:52:52 +01:00
Nicolas Patry
a291d7d4a0
Small fix for publishing rust
2022-02-28 11:50:35 +01:00
Nicolas Patry
1d29a5179c
Adding rust release CI.
2022-02-28 11:30:46 +01:00
Nicolas Patry
a4a68de98a
Workarounds publishing issues:
...
- Upgrade package-lock.json (cannot find VS code attempt)
- Use published `macro_rules_attribute` so `cargo publish` works.
2022-02-28 11:16:46 +01:00
Nicolas Patry
472558cc6f
Force version of proc_macros.
2022-02-28 10:47:45 +01:00
Nicolas Patry
ffaee13994
Preparing for 0.11.6 release.
2022-02-28 10:20:49 +01:00
adamnemecek
9e14648b7f
Use Self where possible, some minor refactoring ( #923 )
...
* Use Self where possible, some minor refactoring
* fixed test
* fixed n_sequences
* reverted non-Self changes
2022-02-28 10:06:24 +01:00
Nicolas Patry
f24092ac62
Trying to fix loading order of added_tokens. ( #924 )
...
* Fixing deserialization order of added_tokens.
* Actually add a test made things more obvious.
It was a mess to handle `special` outside the notion of `AddedToken`.
This would merit an actual rework, as including `special` within the
token should make everything simpler.
For now we just make our lives easy.
* Cleanup.
* Fixing comment.
* Making the test stronger.
2022-02-25 16:38:12 +01:00
Nicolas Patry
b4c3844ffb
Remove comments. ( #925 )
...
They are indeed a bit misleading and coupled with the REGEXP anyway.
2022-02-25 16:38:01 +01:00
Nicolas Patry
d98abc50ba
Fixing single_word AddedToken. ( #919 )
...
* Fixing off by one error in `single_word` AddedToken.
* Real fix for all unicode ranges.
Both `single_word` and `lstrip`, `rstrip` were affected.
* Adding warning when unexpected code path is taken.
2022-02-25 10:32:46 +01:00
PiercarloSlavazza
48a921d399
in serialization.rs, the supplementary tokens are now added "in batch… ( #916 )
...
* in serialization.rs, the supplementary tokens are now added "in batch" to the tokenizer vocabulary, so that the tokenizer trie is built just once (Fix #914 )
Building the trie is a very expensive operation: previously this operation was carried out for every and each token, so that, for large vocabularies, the overall loading time of the vocabulary resulted unacceptable.
* reformatted code of serialization.rs with rustfmt
* Propose code cleanups.
Co-authored-by: Piercarlo Slavazza <p.slavazza@elibra.eu >
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2022-02-25 10:32:23 +01:00
Nicolas Patry
2fecdc10dd
Update the CHANGELOG.
2022-02-16 13:07:31 +01:00
Nicolas Patry
5679323bbc
Minor version bump.
2022-02-16 12:51:11 +01:00
Thomas Wang
88d718207a
tokenizer.save has the wrong arguments compared to documentation ( #901 )
...
* tokenizer.save has the wrong arguments compared to documentation
* Fixing doc of `save` function.
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2022-02-15 17:55:55 +01:00
JC Louis
448054f3c7
fix python3.10 build ( #895 )
2022-01-28 17:51:51 +01:00
Mishig Davaadorj
54efb0a2a2
Merge pull request #890 from huggingface/impl_serde_type
...
Implement `impl_serde_type` macro
2022-01-26 10:55:49 +01:00
Mishig Davaadorj
ac32784517
Revert change for readability
2022-01-26 10:29:04 +01:00
Mishig Davaadorj
0200ce4249
Fix typo
2022-01-25 18:39:10 +01:00
Mishig Davaadorj
fa2ee839bc
Update tokenizers/src/utils/mod.rs
...
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2022-01-25 18:38:23 +01:00
Mishig Davaadorj
3784a04fd4
Make impl_serde_type support unit structs also
2022-01-25 17:57:22 +01:00
Mishig Davaadorj
1adcb63478
Add proc_macros readme
2022-01-24 05:43:03 +01:00
Mishig Davaadorj
9a9c70563a
Implement impl_serde_type macro
2022-01-24 05:15:30 +01:00
Nicolas Patry
a8e07d734f
Changelog.
2022-01-17 22:31:54 +01:00
Nicolas Patry
9b85424520
Version bump.
2022-01-17 22:30:25 +01:00
Nicolas Patry
1a84958cc8
Fixing bad deserialization following inclusion of a default for Punctuation. ( #884 )
...
* Fixing bad deserialization following inclusion of a default for
`Punctuation`.
* don't remove the type now...
* Adding slow test to run on all the tokenizers of the hub.
* `PartialEq` everywhere.
* Forcing `type` to exist on the `pre_tokenizers`.
2022-01-17 22:28:25 +01:00
Nicolas Patry
c2fd765087
Update Cargo.lock for Python.
2022-01-17 10:32:46 +01:00
Nicolas Patry
2c9d039ed0
Update doc versions, + downgrade 3.10 for conda.
2022-01-17 10:22:10 +01:00
Nicolas Patry
a4cf53f6a7
Update CHANGELOG.
2022-01-17 09:56:56 +01:00
Nicolas Patry
ab9a2f3100
Update versions.
2022-01-17 09:40:01 +01:00
Nicolas Patry
4a750f1a57
Fixing Punctuation deserialize without argument. ( #882 )
2022-01-17 09:27:22 +01:00
dependabot[bot]
b18b572ed2
Bump shelljs from 0.8.4 to 0.8.5 in /bindings/node ( #881 )
...
Bumps [shelljs](https://github.com/shelljs/shelljs ) from 0.8.4 to 0.8.5.
- [Release notes](https://github.com/shelljs/shelljs/releases )
- [Changelog](https://github.com/shelljs/shelljs/blob/master/CHANGELOG.md )
- [Commits](https://github.com/shelljs/shelljs/compare/v0.8.4...v0.8.5 )
---
updated-dependencies:
- dependency-name: shelljs
dependency-type: direct:development
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-17 09:26:09 +01:00
JC Louis
cabbecb96c
add python3.10 release ( #877 )
...
* add missing python3.9 classifier
* add python3.10 release
* run tests on 3.10
* Revert "run tests on 3.10"
This reverts commit ceed64249e54b6ec622b06c59bf47da7c6dfc1b0.
2022-01-12 09:42:13 +01:00
Nicolas Patry
076319d542
Aho corasick version for many added tokens. ( #871 )
...
* Aho corasick version.
* Remove test file.
* Compile on `stable`.
2022-01-06 16:04:51 +01:00