06025e4ca1
Adding Sequence
for PostProcessor
. ( #1052 )
...
* Adding `Sequence` for `PostProcessor`.
* Fixing node? Writing in the dark here, don't have Python2.7
* `undefined` is not accepted.
* Other test.
2022-08-25 14:50:06 +02:00
460bdded80
Modify Processor
trait to support chaining. ( #1054 )
...
0 modifications yet, everything will consume the vector.
Every test should be green without any modifications.
2022-08-24 19:49:23 +02:00
b1c9bc68b5
Updating code according to clippy. ( #1048 )
...
- Adding `Eq` where possible
- Denied the ref deref warnings as it was spamming and solution not
really better.
2022-08-24 19:45:15 +02:00
eb2213842b
Update README.md ( #1019 )
...
* Update README.md
Add reference to normalizer blog post
* Update lib.rs
* Fixing PR + clippy on node.
* Update readme to match docstring.
* Other clippy warning.
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2022-07-19 09:54:29 +02:00
adf90dcd72
Adding unstable_wasm
feature + example to run tokenizers
on wasm. ( #1009 )
...
* Adding `unstable_wasm` feature + example to run `tokenizers` on wasm.
Co-Authored-By: josephrocca <1167575+josephrocca@users.noreply.github.com >
Co-Authored-By: Matthias Brunel <matthias.brunel@mithrilsecurity.io >
* Adding some serialization tests.
* Updating with comments.
Co-authored-by: josephrocca <1167575+josephrocca@users.noreply.github.com >
Co-authored-by: Matthias Brunel <matthias.brunel@mithrilsecurity.io >
2022-06-10 14:58:02 +02:00
943b5421aa
Changing Decoder
trait to be more composable. ( #938 ) ( #1008 )
...
* Changing `Decoder` trait to be more composable. (#938 )
* Changing `Decoder` trait to be more composable.
Fix #872
* Fixing Python side.
* Fixing test.
* Updating cleanup signature, removing turbofish.
* Adding `Sequence` Decoder.
2022-06-02 14:43:42 +02:00
8a9bb28f46
Preparing for 0.12.1 ( #978 )
...
* Preparing for 0.12.1
* Updated the changelog.
2022-04-12 17:57:33 +02:00
ec43947786
Revert "Changing Decoder
trait to be more composable. ( #938 )" ( #971 )
...
This reverts commit cdabef14c4
.
2022-04-04 09:43:28 +02:00
0eb7455fe5
Preparing 0.12
release. ( #967 )
...
* Preparing `0.12` release.
* Fix click version: https://github.com/psf/black/issues/2964
2022-03-31 11:06:33 +02:00
28cd3dce2a
Bump minimist from 1.2.5 to 1.2.6 in /bindings/node ( #966 )
...
Bumps [minimist](https://github.com/substack/minimist ) from 1.2.5 to 1.2.6.
- [Release notes](https://github.com/substack/minimist/releases )
- [Commits](https://github.com/substack/minimist/compare/1.2.5...1.2.6 )
---
updated-dependencies:
- dependency-name: minimist
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-03-28 09:52:43 +02:00
cd730594e9
Fixing issue with ConvBert not being able to save because of of holes in ( #954 )
...
the vocab.
2022-03-21 19:28:49 +01:00
daa4dd2288
Making the regex in ByteLevel optional. ( #939 )
...
* Making the regex in ByteLevel optional.
* Changed the stub.
* Beter stub.
* Typo fix.
* Remove bad comments.
2022-03-18 09:03:20 +01:00
cdabef14c4
Changing Decoder
trait to be more composable. ( #938 )
...
* Changing `Decoder` trait to be more composable.
Fix #872
* Fixing Python side.
* Fixing test.
* Updating cleanup signature, removing turbofish.
2022-03-17 10:32:09 +01:00
4b6055d4fb
Adding pickling support for trainers ( #949 )
...
* TMP.
* Adding support for pickling Python trainers.
* Remove not warranted files + missed naming updates.
* Stubbing.
* Making sure serialized format is written in python tests.
2022-03-14 12:18:11 +01:00
a4a68de98a
Workarounds publishing issues:
...
- Upgrade package-lock.json (cannot find VS code attempt)
- Use published `macro_rules_attribute` so `cargo publish` works.
2022-02-28 11:16:46 +01:00
ffaee13994
Preparing for 0.11.6 release.
2022-02-28 10:20:49 +01:00
88d718207a
tokenizer.save has the wrong arguments compared to documentation ( #901 )
...
* tokenizer.save has the wrong arguments compared to documentation
* Fixing doc of `save` function.
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2022-02-15 17:55:55 +01:00
9b85424520
Version bump.
2022-01-17 22:30:25 +01:00
1a84958cc8
Fixing bad deserialization following inclusion of a default for Punctuation
. ( #884 )
...
* Fixing bad deserialization following inclusion of a default for
`Punctuation`.
* don't remove the type now...
* Adding slow test to run on all the tokenizers of the hub.
* `PartialEq` everywhere.
* Forcing `type` to exist on the `pre_tokenizers`.
2022-01-17 22:28:25 +01:00
ab9a2f3100
Update versions.
2022-01-17 09:40:01 +01:00
b18b572ed2
Bump shelljs from 0.8.4 to 0.8.5 in /bindings/node ( #881 )
...
Bumps [shelljs](https://github.com/shelljs/shelljs ) from 0.8.4 to 0.8.5.
- [Release notes](https://github.com/shelljs/shelljs/releases )
- [Changelog](https://github.com/shelljs/shelljs/blob/master/CHANGELOG.md )
- [Commits](https://github.com/shelljs/shelljs/compare/v0.8.4...v0.8.5 )
---
updated-dependencies:
- dependency-name: shelljs
dependency-type: direct:development
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-17 09:26:09 +01:00
076319d542
Aho corasick version for many added tokens. ( #871 )
...
* Aho corasick version.
* Remove test file.
* Compile on `stable`.
2022-01-06 16:04:51 +01:00
152880ab3e
Adding truncation_side within TruncationParams
. ( #860 )
...
* Add truncation to enable_truncation
* Fix typo
* Adding truncation_side within `TruncationParams`.
* Node serialization of this direction param.
* Update the test.
* Fixing warnings/lint.
* Adding stuff (can't local debug :( )
* Slow loop... ;(
* Stub.py.
Co-authored-by: Niels Rogge <niels.rogge1@gmail.com >
2021-12-28 12:37:06 +01:00
c4c9de23a5
Feature: Handle invalid truncate direction ( #858 )
...
* refacto: TruncateDirection -> TruncationDirection
* feat(node): invalid direction will throw
* feat(python): invalid direction will throw
* Update bindings/node/lib/bindings/raw-encoding.test.ts
* Update bindings/python/tests/bindings/test_encoding.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2021-12-27 14:31:57 +01:00
04368b1998
Truncate Right ( #841 )
...
* feat(tokenizers): add truncate test case
* !feat(tokenizer): truncate right
* refacto(tokenizers): clippy
* feat(bindings): update bindings for truncate()
* fix(tokenizers): remove unsafe code
* refacto(tokenizers): truncate direction
* truncate direction enum
* compute parts ranges beforehand
* 2n space because encoding is dropped at the end of procedure
* update bindings
* add pip install in python bindings' make test
* fix(node): clippy asks to use unwrap_or_else
* fix(node): lint
* refacto(tokenizers): replace Vec<Range<usize>> by Vec<(usize, usize)>
* refacto(bindings): add match syntax
* refacto(tokenizers): use mem::replace instead of mem::swap
* refacto(tokenizers): assign value the normal way
2021-12-23 13:34:21 +01:00
c1100ec542
Clippy fixes. ( #846 )
...
* Clippy fixes.
* Drop support for Python 3.6
* Remove other 3.6
* Re-enabling caches for build (5h + seems too long and issue seems
solved)
https://github.com/actions/virtual-environments/issues/572
* `npm audit fix`.
* Fix yaml ?
* Pyarrow issue fixed: https://github.com/huggingface/datasets/pull/2268
* Installing dev libraries.
* Install python dev elsewhere ?
* Typo.
* No sudo.
* ...
* Testing the GH again.
* Maybe v2 will fix ?
* Fixing tests on MacOS Python 3.8+
2021-12-15 15:55:48 +01:00
fd316bdc61
Update esaxx-rs to 0.1.7 to fix building on windows
2021-09-02 20:11:27 +02:00
884bfb7970
Prepare node release ( #794 )
...
* Node - Update changelog for release
* Update node release to add v14 & v15
Co-authored-by: Huan (李卓桓) <zixia@zixia.net >
* Node - Update version number
* Node - Update dependencies
* Node - Lint
Co-authored-by: Huan (李卓桓) <zixia@zixia.net >
2021-09-02 09:58:01 -04:00
23cf8c69ae
Bump tar from 4.4.17 to 4.4.19 in /bindings/node ( #792 )
...
Bumps [tar](https://github.com/npm/node-tar ) from 4.4.17 to 4.4.19.
- [Release notes](https://github.com/npm/node-tar/releases )
- [Changelog](https://github.com/npm/node-tar/blob/main/CHANGELOG.md )
- [Commits](https://github.com/npm/node-tar/compare/v4.4.17...v4.4.19 )
---
updated-dependencies:
- dependency-name: tar
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-09-02 08:06:54 -04:00
35c96e5e3f
Add tests for from_pretrained
2021-08-31 09:00:05 -04:00
528c9a532e
Node - Add bindings to Tokenizer.from_pretrained
2021-08-31 09:00:05 -04:00
5982498195
Switch git dependencies in Cargo.toml back to regular versions ( #728 )
...
* Switch git dependencies in Cargo.toml back to regular versions
rayon-cond turned out to be a rustc bug that has been fixed for a while
(see cuviper/rayon-cond#2 ), so we can revert the git dependency.
numpy has released the commit in question as part of 0.12.
* Also update Cargo.lock files
Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com >
2021-08-13 09:32:00 -04:00
e2bf8daa3a
Add SplitDelimiterBehavior to Punctuation constructor ( #657 )
...
Resolves : #642
2021-08-13 09:19:23 -04:00
46bed542fa
Bump path-parse from 1.0.6 to 1.0.7 in /bindings/node ( #774 )
...
Bumps [path-parse](https://github.com/jbgutierrez/path-parse ) from 1.0.6 to 1.0.7.
- [Release notes](https://github.com/jbgutierrez/path-parse/releases )
- [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7 )
---
updated-dependencies:
- dependency-name: path-parse
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-08-12 09:41:25 -04:00
ab3d3bcbfb
Bump tar from 4.4.13 to 4.4.17 in /bindings/node ( #775 )
...
Bumps [tar](https://github.com/npm/node-tar ) from 4.4.13 to 4.4.17.
- [Release notes](https://github.com/npm/node-tar/releases )
- [Changelog](https://github.com/npm/node-tar/blob/main/CHANGELOG.md )
- [Commits](https://github.com/npm/node-tar/compare/v4.4.13...v4.4.17 )
---
updated-dependencies:
- dependency-name: tar
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-08-12 09:31:47 -04:00
5d1b0a9381
Bump glob-parent from 5.1.1 to 5.1.2 in /bindings/node ( #734 )
...
Bumps [glob-parent](https://github.com/gulpjs/glob-parent ) from 5.1.1 to 5.1.2.
- [Release notes](https://github.com/gulpjs/glob-parent/releases )
- [Changelog](https://github.com/gulpjs/glob-parent/blob/main/CHANGELOG.md )
- [Commits](https://github.com/gulpjs/glob-parent/compare/v5.1.1...v5.1.2 )
---
updated-dependencies:
- dependency-name: glob-parent
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-08-12 09:21:00 -04:00
96c122ccf6
Bump ws from 7.3.1 to 7.4.6 in /bindings/node ( #721 )
...
Bumps [ws](https://github.com/websockets/ws ) from 7.3.1 to 7.4.6.
- [Release notes](https://github.com/websockets/ws/releases )
- [Commits](https://github.com/websockets/ws/compare/7.3.1...7.4.6 )
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-08-12 09:20:36 -04:00
d83772d62c
Fixing tokenizers with 1.53 (updated some dependencies + clippy) ( #764 )
2021-07-21 09:58:38 +02:00
bd19584580
Bump lodash from 4.17.19 to 4.17.21 in /bindings/node ( #701 )
...
Bumps [lodash](https://github.com/lodash/lodash ) from 4.17.19 to 4.17.21.
- [Release notes](https://github.com/lodash/lodash/releases )
- [Commits](https://github.com/lodash/lodash/compare/4.17.19...4.17.21 )
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-05-20 14:22:02 -04:00
8f639b42ea
Bump hosted-git-info from 2.8.8 to 2.8.9 in /bindings/node ( #702 )
...
Bumps [hosted-git-info](https://github.com/npm/hosted-git-info ) from 2.8.8 to 2.8.9.
- [Release notes](https://github.com/npm/hosted-git-info/releases )
- [Changelog](https://github.com/npm/hosted-git-info/blob/v2.8.9/CHANGELOG.md )
- [Commits](https://github.com/npm/hosted-git-info/compare/v2.8.8...v2.8.9 )
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-05-20 14:21:52 -04:00
7574349223
Bump y18n from 4.0.0 to 4.0.3 in /bindings/node ( #708 )
...
Bumps [y18n](https://github.com/yargs/y18n ) from 4.0.0 to 4.0.3.
- [Release notes](https://github.com/yargs/y18n/releases )
- [Changelog](https://github.com/yargs/y18n/blob/y18n-v4.0.3/CHANGELOG.md )
- [Commits](https://github.com/yargs/y18n/compare/v4.0.0...y18n-v4.0.3 )
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-05-20 14:21:40 -04:00
3cf957e6f8
Bump handlebars from 4.7.6 to 4.7.7 in /bindings/node ( #700 )
...
Bumps [handlebars](https://github.com/wycats/handlebars.js ) from 4.7.6 to 4.7.7.
- [Release notes](https://github.com/wycats/handlebars.js/releases )
- [Changelog](https://github.com/handlebars-lang/handlebars.js/blob/master/release-notes.md )
- [Commits](https://github.com/wycats/handlebars.js/compare/v4.7.6...v4.7.7 )
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-05-20 14:21:28 -04:00
2e2e7558f7
Add CTC Decoder for Wave2Vec models ( #693 )
...
* Rust - add a CTCDecoder as a seperate mod
* Adding bindings to Node + Python.
* Clippy update.
* Stub.
* Fixing roberta.json URLs.
* Moving test files to hf.co.
* Update cargo check and clippy to 1.52.
* Inner ':' actually is used for domains in sphinx.
Making `domain` work correctly was just too much work so I went the easy
way and have global roles for the custom rust extension.
* Update struct naming and docs
* Update changelog
Co-authored-by: Thomaub <github.thomaub@gmail.com >
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com >
2021-05-20 09:30:09 -04:00
e1627654b4
Fix Clippy warnings for Rust 1.51
2021-04-05 16:05:48 -04:00
53ab5a470c
Allow unnecessary_wraps for node bindings
2021-03-16 12:32:06 -04:00
817c5ad317
Fix clippy warnings for rust 1.49
2021-01-06 15:03:33 -05:00
ae6534f12d
Bump ini from 1.3.5 to 1.3.8 in /bindings/node ( #561 )
...
Bumps [ini](https://github.com/isaacs/ini ) from 1.3.5 to 1.3.8.
- [Release notes](https://github.com/isaacs/ini/releases )
- [Commits](https://github.com/isaacs/ini/compare/v1.3.5...v1.3.8 )
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2020-12-15 11:50:40 -05:00
49bd055519
Node - Update bindings with train_from_files
2020-11-28 12:29:35 -05:00
dd399d2ad0
Split Pre-Tokenizer ( #542 )
...
* start playing around
* make a first version
* refactor
* apply make format
* add python bindings
* add some python binding tests
* correct pre-tokenizers
* update auto-generated bindings
* lint python bindings
* add code node
* add split to docs
* refactor python binding a bit
* cargo fmt
* clippy and fmt in node
* quick updates and fixes
* Oops
* Update node typings
* Update changelog
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com >
2020-11-27 17:07:03 -05:00
13e07da2c8
Node - Add WordLevelTrainer
2020-11-20 13:30:44 -05:00