1870 Commits

Author SHA1 Message Date
dad9c6c0d2 Revert dev changes 2022-04-21 16:08:22 +02:00
95b5d066d5 Update doc build gh workflow to install rust 2022-04-21 09:20:20 +02:00
c2aa87a256 Add setup.py extras["dev"] 2022-04-19 15:14:44 +02:00
5c97125d22 Fix hashlink ids 2022-04-18 12:13:40 +02:00
f6ba840e3e Add @property docs 2022-04-18 11:58:52 +02:00
fd005a7c4e Add doc-builder gh workflows 2022-04-18 09:50:31 +02:00
6eda286ab1 Init new docs 2022-04-18 09:37:14 +02:00
66c9af26f6 Fixing the documentation for ByteLevel in Python (#982)
* Fixing the documentation for `ByteLevel` in Python

* Python stub.py (after rebuilding ofc).
2022-04-14 16:29:50 +02:00
8a9bb28f46 Preparing for 0.12.1 (#978)
* Preparing for 0.12.1

* Updated the changelog.
2022-04-12 17:57:33 +02:00
4a9da798e2 Adding a new document that is the checklist to make (#975)
* Adding a new document that is the checklist to make

a new `tokenizers` release.
This will help making sure nothing is forgotten.

* Update RELEASE.md

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

* Update RELEASE.md

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* Update RELEASE.md

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* Update RELEASE.md

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* Adding runnning full test suite instructions.

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
2022-04-12 14:18:09 +02:00
ec43947786 Revert "Changing Decoder trait to be more composable. (#938)" (#971)
This reverts commit cdabef14c4.
2022-04-04 09:43:28 +02:00
23a22da18c Update the builder to use earlier windows version (2022) is not understood. (#969)
* Update the builder to use earlier windows version (2022) is not
understood.

* No node for windows.

* Ready to deploy.
2022-03-31 15:00:11 +02:00
0eb7455fe5 Preparing 0.12 release. (#967)
* Preparing `0.12` release.

* Fix click version: https://github.com/psf/black/issues/2964
2022-03-31 11:06:33 +02:00
28cd3dce2a Bump minimist from 1.2.5 to 1.2.6 in /bindings/node (#966)
Bumps [minimist](https://github.com/substack/minimist) from 1.2.5 to 1.2.6.
- [Release notes](https://github.com/substack/minimist/releases)
- [Commits](https://github.com/substack/minimist/compare/1.2.5...1.2.6)

---
updated-dependencies:
- dependency-name: minimist
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-03-28 09:52:43 +02:00
20ec60aeba Adding a link to the Ruby port of tokenizers (#961) 2022-03-24 17:09:30 +01:00
28fe0e40e7 Preventing yelling on empty OrderedVocab (triggered by pickle.dumps). (#963) 2022-03-24 17:09:18 +01:00
a5f644616b Fix the error test for Python 3.10 (error message is different). (#962) 2022-03-23 10:35:58 +01:00
5a79b71b1d Feature gate CLI and clap dependency (#960) 2022-03-22 10:55:53 +01:00
cd730594e9 Fixing issue with ConvBert not being able to save because of of holes in (#954)
the vocab.
2022-03-21 19:28:49 +01:00
1bb9884f45 Fixing the vocab size of the trained Unigram model (#952)
* Fixing the vocab size of the trained Unigram model

* add test for the vocab size of the trained Unigram model

* Revert "add test for the vocab size of the trained Unigram model"

This reverts commit fb8955c831b357d1037548ceaa8789734d544646.

* Fixing the vocab size of the trained Unigram model

* format codes

* get the position of vocab-size calculation out of loop
2022-03-18 18:13:17 +01:00
daa4dd2288 Making the regex in ByteLevel optional. (#939)
* Making the regex in ByteLevel optional.

* Changed the stub.

* Beter stub.

* Typo fix.

* Remove bad comments.
2022-03-18 09:03:20 +01:00
cdabef14c4 Changing Decoder trait to be more composable. (#938)
* Changing `Decoder` trait to be more composable.

Fix #872

* Fixing Python side.

* Fixing test.

* Updating cleanup signature, removing turbofish.
2022-03-17 10:32:09 +01:00
1f1f86dd32 Use thiserror crate for Errors (#951)
* Use `thiserror` crate for Errors

* cargo fmt

* `#[source]` redundant when `#[from]` is present
2022-03-17 09:38:21 +01:00
4b6055d4fb Adding pickling support for trainers (#949)
* TMP.

* Adding support for pickling Python trainers.

* Remove not warranted files + missed naming updates.

* Stubbing.

* Making sure serialized format is written in python tests.
2022-03-14 12:18:11 +01:00
71ae5421eb Python - add initial_alphabet to spm unigram trainer (#942)
* Python - add initial_alphabet to spm unigram trainer

* Python - use optional instead of mutable defaults in spm unigram trainer
2022-03-09 09:54:03 +01:00
98249dfb0f Python - add doctype to length in implementations spm unigram (#943) 2022-03-08 11:59:07 +01:00
4a8f5db067 Python - Add length to train_from_iterator in implementations (#937) 2022-03-04 14:11:58 +01:00
845da6d8e8 Feat/m1 manual build (#936)
* feat(bindings): move target compilation flags to correct config file

* feat(bindings): m1 build 'script'

* feat(ci): for loop in bdist_wheel script for m1

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-03-02 14:44:13 +01:00
03e10b606d Publishing a crate does not require a specific OS. 2022-02-28 11:52:52 +01:00
a291d7d4a0 Small fix for publishing rust 2022-02-28 11:50:35 +01:00
1d29a5179c Adding rust release CI. 2022-02-28 11:30:46 +01:00
a4a68de98a Workarounds publishing issues:
- Upgrade package-lock.json (cannot find VS code attempt)
- Use published `macro_rules_attribute` so `cargo publish` works.
2022-02-28 11:16:46 +01:00
472558cc6f Force version of proc_macros. 2022-02-28 10:47:45 +01:00
ffaee13994 Preparing for 0.11.6 release. 2022-02-28 10:20:49 +01:00
9e14648b7f Use Self where possible, some minor refactoring (#923)
* Use Self where possible, some minor refactoring

* fixed test

* fixed n_sequences

* reverted non-Self changes
2022-02-28 10:06:24 +01:00
f24092ac62 Trying to fix loading order of added_tokens. (#924)
* Fixing deserialization order of added_tokens.

* Actually add a test made things more obvious.

It was a mess to handle `special` outside the notion of `AddedToken`.
This would merit an actual rework, as including `special` within the
token should make everything simpler.
For now we just make our lives easy.

* Cleanup.

* Fixing comment.

* Making the test stronger.
2022-02-25 16:38:12 +01:00
b4c3844ffb Remove comments. (#925)
They are indeed a bit misleading and coupled with the REGEXP anyway.
2022-02-25 16:38:01 +01:00
d98abc50ba Fixing single_word AddedToken. (#919)
* Fixing off by one error in `single_word` AddedToken.

* Real fix for all unicode ranges.

Both `single_word` and `lstrip`, `rstrip` were affected.

* Adding warning when unexpected code path is taken.
2022-02-25 10:32:46 +01:00
48a921d399 in serialization.rs, the supplementary tokens are now added "in batch… (#916)
* in serialization.rs, the supplementary tokens are now added "in batch" to the tokenizer vocabulary, so that the tokenizer trie is built just once (Fix #914)

Building the trie is a very expensive operation: previously this operation was carried out for every and each token, so that, for large vocabularies, the overall loading time of the vocabulary resulted unacceptable.

* reformatted code of serialization.rs with rustfmt

* Propose code cleanups.

Co-authored-by: Piercarlo Slavazza <p.slavazza@elibra.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-02-25 10:32:23 +01:00
2fecdc10dd Update the CHANGELOG. 2022-02-16 13:07:31 +01:00
5679323bbc Minor version bump. 2022-02-16 12:51:11 +01:00
88d718207a tokenizer.save has the wrong arguments compared to documentation (#901)
* tokenizer.save has the wrong arguments compared to documentation

* Fixing doc of `save` function.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-02-15 17:55:55 +01:00
448054f3c7 fix python3.10 build (#895) 2022-01-28 17:51:51 +01:00
54efb0a2a2 Merge pull request #890 from huggingface/impl_serde_type
Implement `impl_serde_type` macro
2022-01-26 10:55:49 +01:00
ac32784517 Revert change for readability 2022-01-26 10:29:04 +01:00
0200ce4249 Fix typo 2022-01-25 18:39:10 +01:00
fa2ee839bc Update tokenizers/src/utils/mod.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-01-25 18:38:23 +01:00
3784a04fd4 Make impl_serde_type support unit structs also 2022-01-25 17:57:22 +01:00
1adcb63478 Add proc_macros readme 2022-01-24 05:43:03 +01:00
9a9c70563a Implement impl_serde_type macro 2022-01-24 05:15:30 +01:00