Commit Graph

1549 Commits

Author SHA1 Message Date
cdabef14c4 Changing Decoder trait to be more composable. (#938)
* Changing `Decoder` trait to be more composable.

Fix #872

* Fixing Python side.

* Fixing test.

* Updating cleanup signature, removing turbofish.
2022-03-17 10:32:09 +01:00
1f1f86dd32 Use thiserror crate for Errors (#951)
* Use `thiserror` crate for Errors

* cargo fmt

* `#[source]` redundant when `#[from]` is present
2022-03-17 09:38:21 +01:00
4b6055d4fb Adding pickling support for trainers (#949)
* TMP.

* Adding support for pickling Python trainers.

* Remove not warranted files + missed naming updates.

* Stubbing.

* Making sure serialized format is written in python tests.
2022-03-14 12:18:11 +01:00
71ae5421eb Python - add initial_alphabet to spm unigram trainer (#942)
* Python - add initial_alphabet to spm unigram trainer

* Python - use optional instead of mutable defaults in spm unigram trainer
2022-03-09 09:54:03 +01:00
98249dfb0f Python - add doctype to length in implementations spm unigram (#943) 2022-03-08 11:59:07 +01:00
4a8f5db067 Python - Add length to train_from_iterator in implementations (#937) 2022-03-04 14:11:58 +01:00
845da6d8e8 Feat/m1 manual build (#936)
* feat(bindings): move target compilation flags to correct config file

* feat(bindings): m1 build 'script'

* feat(ci): for loop in bdist_wheel script for m1

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-03-02 14:44:13 +01:00
03e10b606d Publishing a crate does not require a specific OS. 2022-02-28 11:52:52 +01:00
a291d7d4a0 Small fix for publishing rust 2022-02-28 11:50:35 +01:00
1d29a5179c Adding rust release CI. 2022-02-28 11:30:46 +01:00
a4a68de98a Workarounds publishing issues:
- Upgrade package-lock.json (cannot find VS code attempt)
- Use published `macro_rules_attribute` so `cargo publish` works.
2022-02-28 11:16:46 +01:00
472558cc6f Force version of proc_macros. 2022-02-28 10:47:45 +01:00
ffaee13994 Preparing for 0.11.6 release. 2022-02-28 10:20:49 +01:00
9e14648b7f Use Self where possible, some minor refactoring (#923)
* Use Self where possible, some minor refactoring

* fixed test

* fixed n_sequences

* reverted non-Self changes
2022-02-28 10:06:24 +01:00
f24092ac62 Trying to fix loading order of added_tokens. (#924)
* Fixing deserialization order of added_tokens.

* Actually add a test made things more obvious.

It was a mess to handle `special` outside the notion of `AddedToken`.
This would merit an actual rework, as including `special` within the
token should make everything simpler.
For now we just make our lives easy.

* Cleanup.

* Fixing comment.

* Making the test stronger.
2022-02-25 16:38:12 +01:00
b4c3844ffb Remove comments. (#925)
They are indeed a bit misleading and coupled with the REGEXP anyway.
2022-02-25 16:38:01 +01:00
d98abc50ba Fixing single_word AddedToken. (#919)
* Fixing off by one error in `single_word` AddedToken.

* Real fix for all unicode ranges.

Both `single_word` and `lstrip`, `rstrip` were affected.

* Adding warning when unexpected code path is taken.
2022-02-25 10:32:46 +01:00
48a921d399 in serialization.rs, the supplementary tokens are now added "in batch… (#916)
* in serialization.rs, the supplementary tokens are now added "in batch" to the tokenizer vocabulary, so that the tokenizer trie is built just once (Fix #914)

Building the trie is a very expensive operation: previously this operation was carried out for every and each token, so that, for large vocabularies, the overall loading time of the vocabulary resulted unacceptable.

* reformatted code of serialization.rs with rustfmt

* Propose code cleanups.

Co-authored-by: Piercarlo Slavazza <p.slavazza@elibra.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-02-25 10:32:23 +01:00
2fecdc10dd Update the CHANGELOG. 2022-02-16 13:07:31 +01:00
5679323bbc Minor version bump. 2022-02-16 12:51:11 +01:00
88d718207a tokenizer.save has the wrong arguments compared to documentation (#901)
* tokenizer.save has the wrong arguments compared to documentation

* Fixing doc of `save` function.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-02-15 17:55:55 +01:00
448054f3c7 fix python3.10 build (#895) 2022-01-28 17:51:51 +01:00
54efb0a2a2 Merge pull request #890 from huggingface/impl_serde_type
Implement `impl_serde_type` macro
2022-01-26 10:55:49 +01:00
ac32784517 Revert change for readability 2022-01-26 10:29:04 +01:00
0200ce4249 Fix typo 2022-01-25 18:39:10 +01:00
fa2ee839bc Update tokenizers/src/utils/mod.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-01-25 18:38:23 +01:00
3784a04fd4 Make impl_serde_type support unit structs also 2022-01-25 17:57:22 +01:00
1adcb63478 Add proc_macros readme 2022-01-24 05:43:03 +01:00
9a9c70563a Implement impl_serde_type macro 2022-01-24 05:15:30 +01:00
a8e07d734f Changelog. 2022-01-17 22:31:54 +01:00
9b85424520 Version bump. 2022-01-17 22:30:25 +01:00
1a84958cc8 Fixing bad deserialization following inclusion of a default for Punctuation. (#884)
* Fixing bad deserialization following inclusion of a default for
`Punctuation`.

* don't remove the type now...

* Adding slow test to run on all the tokenizers of the hub.

* `PartialEq` everywhere.

* Forcing `type` to exist on the `pre_tokenizers`.
2022-01-17 22:28:25 +01:00
c2fd765087 Update Cargo.lock for Python. 2022-01-17 10:32:46 +01:00
2c9d039ed0 Update doc versions, + downgrade 3.10 for conda. 2022-01-17 10:22:10 +01:00
a4cf53f6a7 Update CHANGELOG. 2022-01-17 09:56:56 +01:00
ab9a2f3100 Update versions. 2022-01-17 09:40:01 +01:00
4a750f1a57 Fixing Punctuation deserialize without argument. (#882) 2022-01-17 09:27:22 +01:00
b18b572ed2 Bump shelljs from 0.8.4 to 0.8.5 in /bindings/node (#881)
Bumps [shelljs](https://github.com/shelljs/shelljs) from 0.8.4 to 0.8.5.
- [Release notes](https://github.com/shelljs/shelljs/releases)
- [Changelog](https://github.com/shelljs/shelljs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/shelljs/shelljs/compare/v0.8.4...v0.8.5)

---
updated-dependencies:
- dependency-name: shelljs
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-17 09:26:09 +01:00
cabbecb96c add python3.10 release (#877)
* add missing python3.9 classifier

* add python3.10 release

* run tests on 3.10

* Revert "run tests on 3.10"

This reverts commit ceed64249e54b6ec622b06c59bf47da7c6dfc1b0.
2022-01-12 09:42:13 +01:00
076319d542 Aho corasick version for many added tokens. (#871)
* Aho corasick version.

* Remove test file.

* Compile on `stable`.
2022-01-06 16:04:51 +01:00
fb837b4adb Fix wordlevel encode <unk> (#870)
* Fix wordlevel encode `<unk>`

* Better unit test name

* Refactor
2022-01-06 16:04:16 +01:00
8e0d66a254 New python version. 2022-01-04 14:58:02 +01:00
6972e49f1d Fix the clippy warnings. (#869) 2022-01-04 14:32:07 +01:00
1054e243e2 Fix invalid continuing subwrd prefix. (#864)
* Creating failing test for invalid continuing subwrd prefix.

* Test in rust + the associated fix.

* Clippy.

* Black.
2022-01-04 14:25:35 +01:00
4122a33f09 Fixing missing direction in TruncationParams. (#868) 2022-01-04 14:21:46 +01:00
7069988ffe Update to 0.11.1 2021-12-28 13:59:31 +01:00
152880ab3e Adding truncation_side within TruncationParams. (#860)
* Add truncation to enable_truncation

* Fix typo

* Adding truncation_side within `TruncationParams`.

* Node serialization of this direction param.

* Update the test.

* Fixing warnings/lint.

* Adding stuff (can't local debug :( )

* Slow loop... ;(

* Stub.py.

Co-authored-by: Niels Rogge <niels.rogge1@gmail.com>
2021-12-28 12:37:06 +01:00
c4c9de23a5 Feature: Handle invalid truncate direction (#858)
* refacto: TruncateDirection -> TruncationDirection

* feat(node): invalid direction will throw

* feat(python): invalid direction will throw

* Update bindings/node/lib/bindings/raw-encoding.test.ts

* Update bindings/python/tests/bindings/test_encoding.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2021-12-27 14:31:57 +01:00
38a85b2112 Last touches for conda hopefully
- Missng env activation for many linux + upload
2021-12-24 08:05:09 +01:00
943f4ef469 Preparing for 0.11.0 Re-release. (#856)
* Starting from master again.

Upgrade libssl everywhere on quay

Extra is ubuntu based (running the quay in a container).

making only extra run + attempt to fix ssl update.

Extra with newer openssl versions.

`-y`.

Use checkoint@v2 + remove `-` from environment name.

Debugging back the conda release..

Attempt to use `base` env.

3.7 requires `activate-environement: true.

MacOS and windows don't run on manylinux.

Remove yum on windows/macOs.

Miniconda doesn't like manylinux2014 anymore ?

Attempting different approach for manylinux + conda.

Use wget.

Extra bracet.

Executing $filename

Activate the env.

Activate the env on eevery step that requires it.

Openssl-devel.

Activating env for extracting version ?

Retest all workflows.

Manylinux2010 requires checkout@v1

Run on tag for extra and conda again.

openssl-devel.

* Putting back into deploy state.

* Adding links in CHANGELOG.

* Remove clippy from changelog.
2021-12-23 16:43:48 +01:00