d19bc63c67
Merge pull request #1212 from huggingface/fix-node-release
...
Fix node release
2023-04-06 16:25:29 +02:00
a714aac6f6
revert changes
2023-04-06 14:07:46 +00:00
ceb73dbd29
publish npm
2023-04-06 13:35:29 +00:00
42b110587c
Fix conda release ( #1211 )
...
* runs-on: ubuntu-latest
* update jobs
* remove uploads
* upload conda package, remove python-release CI
* revert changes
* revart whitespace removal
2023-04-06 12:30:14 +02:00
fbd8d6188e
update for testing
2023-04-06 10:29:42 +00:00
37372b67fa
Merge pull request #1207 from huggingface/v0.13.3
...
New release
2023-04-05 09:58:19 +02:00
ce244bd094
remove rc1
2023-04-04 16:19:42 +02:00
a05be6b8d1
Merge pull request #1205 from huggingface/new_version
...
New version 0.13.3
2023-04-04 15:03:38 +02:00
1cb44bd180
New version 0.13.3
2023-04-04 14:14:17 +02:00
3aaf4946b3
Add content
to Strip decoder to allow decoding mid tokens. ( #1199 )
...
* Add `content` to Strip decoder to allow decoding mid tokens.
* Stub.
* Clippy.
2023-03-24 10:14:49 +01:00
8a6a8dc9d5
Fixing decoder strip because of char boundaries. ( #1197 )
2023-03-24 01:57:39 +01:00
e4aea890d5
Adding 2 new decoders: ( #1196 )
...
* Adding 2 new decoders:
- Fuse will simply concatenate all tokens into 1 string
- Strip will remove n char from left or right
Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want
for the `Metaspace` thing.
- Note: Added a new dependency from better parsing of decoders.
This is due to untagged enums which can match anything the `MustBe`
ensure there's no issue between Fuse and ByteFallback.
Since both are new the chances for backward incompatibility is low.
* Fixing picking/unpickling (using default args.).
* Stub.
* Black.
* Fixing node.
2023-03-24 00:50:54 +01:00
d2c8190a0f
Creating normalizers.Prepend
(To be used instead of Metaspace
). ( #1194 )
...
* Creating `normalizers.Prepend` (To be used instead of `Metaspace`).
* Linting + stub.
* Fixing pickling/unpickling by setting a default.
* Black.
2023-03-24 00:33:31 +01:00
250d46c676
Adding Replace
to decoder (to undo the Replace Normalizer for ( #1195 )
...
Metaspace split).
2023-03-23 23:43:47 +01:00
178e294a6a
Merge pull request #1192 from huggingface/faster-datasets-train-example
...
Faster `datasets` train example
2023-03-23 17:19:05 +01:00
73637a0004
Adding ByteFallback support for tokenizers
. ( #1183 )
...
* Adding ByteFallback support for `tokenizers`.
Two items added:
- A flag `byte_fallback` for the `BPE` model. This will be in charge
of using `<0x61>` instead of unk on unknown tokens.
- A ByteFallback decoder, which will be in charge of putting everything
back into string whenever possible. Showing � when the byte decoding
fails (behavior checked against LlamaTokenizer in `transformers`.
* Update rustdoc.
* Clippy + Add BPE(byte_fallback) into bindings.
* Stupid file.
* Test artifacts removed.
* Update stub.
* Fix.
* Bad file.
* CRITICAL FIX: wrapper order because of untagged....
* Remove prints.
* Fixing <16 byte fallback.
2023-03-23 16:04:32 +01:00
e76f900bc0
Faster datasets
train example
...
Using .iter() is much faster than accessing using row ids
2023-03-23 11:24:30 +01:00
b8fbea00a9
Bump dirs from 3.0 to 4.0 ( #1142 )
2023-03-21 10:32:02 +01:00
5ecd329503
Fixing infinite loop in UnigramTrainer. ( #1182 )
...
* Fixing infinite loop in UnigramTrainer.
* Newer clippy.
2023-03-15 14:59:01 +01:00
9c0e700212
Bump webpack in /tokenizers/examples/unstable_wasm/www ( #1181 )
...
Bumps [webpack](https://github.com/webpack/webpack ) from 5.75.0 to 5.76.0.
- [Release notes](https://github.com/webpack/webpack/releases )
- [Commits](https://github.com/webpack/webpack/compare/v5.75.0...v5.76.0 )
---
updated-dependencies:
- dependency-name: webpack
dependency-type: direct:development
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-03-15 10:54:26 +01:00
5c18ec5ff5
pyo3 v0.18 migration ( #1173 )
...
* pyo v0.18 migration
* Fix formatting issues of black
2023-03-08 11:27:47 +01:00
3138657565
Using clippy 1.67 ( #1167 )
2023-03-02 12:28:39 +01:00
ac552ff8b9
Update model.rs ( #1166 )
2023-02-28 17:35:57 +01:00
fa66caf0ab
Improved version. ( #1154 )
...
* Improved version.
* Clippy.
2023-01-23 16:35:19 +01:00
d09241fba1
Prevent using from_pretrained
on invalid ids (better error message). ( #1153 )
2023-01-23 15:38:14 +01:00
b861d48b06
Making Tokenizer
clone. ( #1152 )
2023-01-23 10:12:35 +01:00
1fcd90b0b7
Update info on environment variable for threading ( #1150 )
...
* Update env var name for threading
* Update env var name for threading
2023-01-22 21:24:41 +01:00
33a57e6418
Made dirs optional ( #1148 )
2023-01-18 09:29:15 +01:00
daf8aebd76
Adding python 3.8 for M1 ( #1147 )
2023-01-16 16:40:46 +01:00
5a94a2b6e7
Add missing build targets ( #1145 )
...
* M1 3.11 was not out neither windows amd64.
* python@v4.
* Actually upload.
* Update needs.
* Preparing the actual PR.
2023-01-15 10:18:08 +01:00
fe4ae7dc38
Bump json5 from 2.2.0 to 2.2.3 in /bindings/node ( #1140 )
...
Bumps [json5](https://github.com/json5/json5 ) from 2.2.0 to 2.2.3.
- [Release notes](https://github.com/json5/json5/releases )
- [Changelog](https://github.com/json5/json5/blob/main/CHANGELOG.md )
- [Commits](https://github.com/json5/json5/compare/v2.2.0...v2.2.3 )
---
updated-dependencies:
- dependency-name: json5
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-03 11:50:51 +01:00
c3fedd96b3
Bump json5, copy-webpack-plugin, webpack and webpack-cli ( #1139 )
...
Removes [json5](https://github.com/json5/json5 ). It's no longer used after updating ancestor dependencies [json5](https://github.com/json5/json5 ), [copy-webpack-plugin](https://github.com/webpack-contrib/copy-webpack-plugin ), [webpack](https://github.com/webpack/webpack ) and [webpack-cli](https://github.com/webpack/webpack-cli ). These dependencies need to be updated together.
Removes `json5`
Updates `copy-webpack-plugin` from 5.1.2 to 11.0.0
- [Release notes](https://github.com/webpack-contrib/copy-webpack-plugin/releases )
- [Changelog](https://github.com/webpack-contrib/copy-webpack-plugin/blob/master/CHANGELOG.md )
- [Commits](https://github.com/webpack-contrib/copy-webpack-plugin/compare/v5.1.2...v11.0.0 )
Updates `webpack` from 4.46.0 to 5.75.0
- [Release notes](https://github.com/webpack/webpack/releases )
- [Commits](https://github.com/webpack/webpack/compare/v4.46.0...v5.75.0 )
Updates `webpack-cli` from 3.3.12 to 5.0.1
- [Release notes](https://github.com/webpack/webpack-cli/releases )
- [Changelog](https://github.com/webpack/webpack-cli/blob/master/CHANGELOG.md )
- [Commits](https://github.com/webpack/webpack-cli/compare/v3.3.12...webpack-cli@5.0.1 )
---
updated-dependencies:
- dependency-name: json5
dependency-type: indirect
- dependency-name: copy-webpack-plugin
dependency-type: direct:development
- dependency-name: webpack
dependency-type: direct:development
- dependency-name: webpack-cli
dependency-type: direct:development
...
Signed-off-by: dependabot[bot] <support@github.com >
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-03 10:22:49 +01:00
9b155b5723
[FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. ( #1136 )
...
* [fix] Use unk_token
In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used.
* [fix] If unk_token is None, this case is also considered.
* Update bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
* [FIX] In CharBPETokenizer, Use unk_token.
In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used.
* Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
* Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2022-12-27 11:13:52 +01:00
60a00dda44
Fix one char super tiny typo ( #1137 )
...
* Update pipeline.mdx
* Update pipeline.rst
2022-12-26 11:13:38 +01:00
4d520c9664
Ignore Cargo.lock for subfolders ( #1131 )
2022-12-25 11:35:47 +01:00
fbad581128
Bump derive_builder from 0.9 to 0.12 ( #1129 )
2022-12-23 23:37:16 +01:00
2bed678958
Fix broken links in docs ( #1133 )
2022-12-23 23:35:18 +01:00
3e7476de86
Wrap rustdoc html entity in code block ( #1130 )
2022-12-23 23:30:45 +01:00
03ce27d2fa
Bump cached-path from 0.5 to 0.6 ( #1127 )
2022-12-21 18:10:48 +01:00
5886179eee
Bump decode-uri-component in /tokenizers/examples/unstable_wasm/www ( #1125 )
...
Bumps [decode-uri-component](https://github.com/SamVerschueren/decode-uri-component ) from 0.2.0 to 0.2.2.
- [Release notes](https://github.com/SamVerschueren/decode-uri-component/releases )
- [Commits](https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.0...v0.2.2 )
---
updated-dependencies:
- dependency-name: decode-uri-component
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-19 14:24:24 +01:00
a408b44429
Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node ( #1126 )
...
Bumps [minimatch](https://github.com/isaacs/minimatch ) from 3.0.4 to 3.1.2.
- [Release notes](https://github.com/isaacs/minimatch/releases )
- [Changelog](https://github.com/isaacs/minimatch/blob/main/changelog.md )
- [Commits](https://github.com/isaacs/minimatch/compare/v3.0.4...v3.1.2 )
---
updated-dependencies:
- dependency-name: minimatch
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-19 14:09:24 +01:00
bfa842e063
Adding stale bot ? ( #1123 )
...
* Adding stale bot ?
* Clippy.
2022-12-19 13:50:48 +01:00
1649d74536
Fixing conda ssl location ( #1124 )
...
* Fixing conda build ?
* Reduce the scope to speedup testing.
* Reduce more.
* Trying to link to conda lib.
* Trying to enable `pkg-config` on the codna env.
* Really publish.
* Update conda builds.
* Remove 3.11
* Putting releases back onto release track.
2022-12-19 13:50:36 +01:00
9a25b2cb8e
[FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. ( #1120 )
...
* [fix] Use unk_token
In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used.
* [fix] If unk_token is None, this case is also considered.
* Update bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2022-12-19 13:40:04 +01:00
102dfe87a3
Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node ( #1116 )
...
Bumps [decode-uri-component](https://github.com/SamVerschueren/decode-uri-component ) from 0.2.0 to 0.2.2.
- [Release notes](https://github.com/SamVerschueren/decode-uri-component/releases )
- [Commits](https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.0...v0.2.2 )
---
updated-dependencies:
- dependency-name: decode-uri-component
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-05 18:09:38 +01:00
67080e163a
Include license file in Rust crate ( #1115 )
...
* Include license file in Rust crate
* Ignore security warning.
* Also for python.
* Upgrading ubuntu version.
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2022-11-30 23:17:56 +01:00
c74e9e62f6
Bump loader-utils in /tokenizers/examples/unstable_wasm/www ( #1108 )
...
Bumps [loader-utils](https://github.com/webpack/loader-utils ) from 1.4.0 to 1.4.2.
- [Release notes](https://github.com/webpack/loader-utils/releases )
- [Changelog](https://github.com/webpack/loader-utils/blob/v1.4.2/CHANGELOG.md )
- [Commits](https://github.com/webpack/loader-utils/compare/v1.4.0...v1.4.2 )
---
updated-dependencies:
- dependency-name: loader-utils
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-11-16 12:01:25 +01:00
e9529cb02f
Merge pull request #1107 from huggingface/revert-1101-update_doc_pr_actions
...
Revert "Update pr docs actions"
2022-11-16 11:41:51 +01:00
ffcf5a4136
Revert "Update pr docs actions ( #1101 )"
...
This reverts commit 99c06c82e0
.
2022-11-16 11:41:38 +01:00
bbae829a72
Adding rust audit. ( #1099 )
...
* Adding rust audit.
* Update clap version + derive_builder (they clashed).
* Ignoring specific CVE which can be ignored
https://github.com/Azure/iot-identity-service/issues/481
* Updating python lock.
* Revert `derive-builder` update.
* Adding back help msg.
2022-11-09 12:59:36 +01:00