363 Commits

Author SHA1 Message Date
f1faec1756 Fix typos in strings and comments (#1770) 2025-05-27 08:17:36 +02:00
4383a25787 Update the release builds following 0.21.1. (#1746)
* Update the release builds following 0.21.1.

* Clippy fix.
2025-03-13 13:01:41 +01:00
24d29f498d Update dev version and pyproject.toml (#1693)
* update pyproject.toml

* update py dev version
2024-11-27 16:01:48 +01:00
1bf2a66b80 v0.20.4-dev0 2024-11-27 10:07:49 +01:00
eb4cc86d4e Bump cross-spawn from 6.0.5 to 6.0.6 in /bindings/node (#1687)
Bumps [cross-spawn](https://github.com/moxystudio/node-cross-spawn) from 6.0.5 to 6.0.6.
- [Changelog](https://github.com/moxystudio/node-cross-spawn/blob/v6.0.6/CHANGELOG.md)
- [Commits](https://github.com/moxystudio/node-cross-spawn/compare/v6.0.5...v6.0.6)

---
updated-dependencies:
- dependency-name: cross-spawn
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-25 10:04:06 +01:00
7c36735389 v0.20.2-dev.0 version 2024-11-04 18:36:40 +01:00
51826532d4 push new dev version 2024-10-10 12:00:16 +02:00
81c471cf17 update dev version 0.20.0 2024-08-08 18:11:50 +02:00
bded212356 Support None to reset pre_tokenizers and normalizers, and index sequences (#1590)
* initial commit

* support None

* fix clippy

* cleanup

* clean?

* propagate to pre_tokenizer

* fix test

* fix rust tests

* fix node

* propagate to decoder and post processor

* fix calls

* lint

* fmt

* node be happy I am fixing you

* initial commit

* support None

* fix clippy

* cleanup

* clean?

* propagate to pre_tokenizer

* fix test

* fix rust tests

* fix node

* propagate to decoder and post processor

* fix calls

* lint

* fmt

* node be happy I am fixing you

* add a small test

* styling

* style merge

* fix merge test

* fmt

* nits

* update tset
2024-08-07 12:52:35 +02:00
71c2a8d01a update dev version so 0.19.1 2024-04-17 23:17:12 +02:00
949d9e3e0e Bumping all versions 3 times (ty transformers :) ) (#1498) 2024-04-16 15:58:36 +02:00
6e58f838b3 version = "0.16.0-dev.0" 2024-04-02 09:51:14 +02:00
09069717e9 Refactor metaspace (#1476)
* version = "0.15.3-dev-0”

Improve performances of meta space, but also just fix it.

(transformers) ➜  transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
[0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848]
(transformers) ➜  transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
[0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]

* well what do we have

* nit

* be BC with non legacy

* unrelated change for clippy

* fix test

* splitting is a must for word_ids

* fmt and lint

* Fixing everything (hopefully better).

* Fixing node.

* Including yarn.lock

* Lint.

* Stubs.

* revert to use split

* fix merge issues

* fix tests

* finish fixing tests

* ruff

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-03-30 10:27:24 +01:00
d8c4388166 Bump ip from 2.0.0 to 2.0.1 in /bindings/node (#1456)
Bumps [ip](https://github.com/indutny/node-ip) from 2.0.0 to 2.0.1.
- [Commits](https://github.com/indutny/node-ip/compare/v2.0.0...v2.0.1)

---
updated-dependencies:
- dependency-name: ip
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-25 11:29:36 +01:00
7f49f20ab0 version = "0.15.3-dev-0” 2024-02-12 09:48:00 +09:00
8f73fe9515 update dev version to 0.15.2-dev.0 2024-01-22 15:34:57 +01:00
e3bcef288b udpate to version = "0.15.1-dev0" (#1390)
* Apply suggestions from code review
2023-11-15 13:30:58 +01:00
c718c53bb9 Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node (#1370)
Bumps [@babel/traverse](https://github.com/babel/babel/tree/HEAD/packages/babel-traverse) from 7.22.11 to 7.23.2.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.23.2/packages/babel-traverse)

---
updated-dependencies:
- dependency-name: "@babel/traverse"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-25 08:14:32 +02:00
4322056e6e Preparing release. (#1355)
* Preparing release.

* Fix new clippy
2023-10-06 12:56:36 +02:00
2c565e42c7 update package version for dev (#1339) 2023-09-07 16:19:24 +02:00
d2010d5165 Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331)
* Move to maturing mimicking move for `safetensors`.

* Tmp.

* Fix sdist.

* Wat?

* Clippy 1.72

* Remove if.

* Conda sed.

* Fix doc check workflow.

* Moving to maturin AND removing http + openssl mess (smoothing transition
moving to `huggingface_hub`)

* Fix dep

* Black.

* New node bindings.

* Fix docs + node cache ?

* Yarn.

* Working dir.

* Extension module.

* Put back interpreter.

* Remove cache.

* New attempt

* Multi python.

* Remove FromPretrained.

* Remove traces of `fromPretrained`.

* Drop 3.12 for windows?

* Typo.

* Put back the default feature for ignoring links during simple test.

* Fix ?

* x86_64 -> x64.

* Remove warning for windows bindings.

* Excluse aarch.

* Include/exclude.

* Put back workflows in correct states.
2023-08-28 16:24:14 +02:00
fb292d1eae 0.13.4.rc1 (#1319) 2023-08-14 12:06:43 +02:00
862046ac94 CD backports (#1318)
* CD backports

follow
huggingface/safetensors#317

* fix node bindings?

`cargo check` doesnt work on my local configuration from `tokenizers/bindings/node/native`
i don't think it will be a problem but i have difficulty telling

* backport #315

* safetensors#317 back ports
2023-08-10 18:52:22 +02:00
ea4d3f634c Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node (#1299)
Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.4.
- [Release notes](https://github.com/jonschlinkert/word-wrap/releases)
- [Commits](https://github.com/jonschlinkert/word-wrap/compare/1.2.3...1.2.4)

---
updated-dependencies:
- dependency-name: word-wrap
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-21 08:08:10 +02:00
291b2e23ae Fixing clippy warnings on 1.71. (#1296)
* Fixing clippy warnings on 1.71.

* Fix.

* Fmt.

* Python clippy.

* Should really set my env back again.

* Fix.
2023-07-16 15:58:38 +02:00
92bfb9c993 Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node (#1291)
Bumps [tough-cookie](https://github.com/salesforce/tough-cookie) from 4.0.0 to 4.1.3.
- [Release notes](https://github.com/salesforce/tough-cookie/releases)
- [Changelog](https://github.com/salesforce/tough-cookie/blob/master/CHANGELOG.md)
- [Commits](https://github.com/salesforce/tough-cookie/compare/v4.0.0...v4.1.3)

---
updated-dependencies:
- dependency-name: tough-cookie
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-10 09:44:31 +02:00
864135bef1 Add unigram bytefallback (#1217)
* current updates will go red

* cargo fmt

* npm install

* refactor train for unigram to allow bytefallbakc (breaking)

* fmt

* nits

* update

* add a proper test

* fix encode optimised fallback + add trainer arg

* fixes

* fixes

* fix tests

* add test

* fmt

* fix rust test

* update python bindings

* update

* pub is okay and needed

* more fix

* cleanup

* remove useles id

* MissingUnkId error

* nits

* fix offset

* add a test in python

* update src bindings

* remove bytefallback from trainer

* styling

* update pckg

* lint

* fmt

* stup with dev

* update code based on review

* remove unused function

* udpate python test to compare ids

* fix option bool issues

* final fix

* clippy

* fix npm isntall

* update

* update test

* more in depth testing

* Lint

* last attempt to fix node

* update node bindings

* fmt

* Update tokenizers/src/models/unigram/model.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* update based on review

* simpler test

* lint

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-06-26 10:46:59 +02:00
b4fcc9ce6e Makes decode and decode_batch work on borrowed content. (#1251)
* Makes `decode` and `decode_batch` work on borrowed content.

* Make `decode_batch` work with borrowed content.

* Fix lint.

* Attempt to map it into Node.

* Second attempt.

* Step by step.

* One more step.

* Fix lint.

* Please ...

* Removing collect.

* Revert "Removing collect."

This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-17 11:18:15 +02:00
cefc41e8ec implement a simple max_sentencepiece_length into BPE (#1228)
* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* utilize Option<u16> for safer code.

* Other version.

* Update trainer.rs

clarify with type usize propagate max_length option

* change max_length into more descriptive name

in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.

* change variable name in trainer.rs

change max_merge_length into max_token_length

* Update trainer.rs

add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer

Add explanation for variable shadowing.

* Update trainer.rs

Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.

* add max_token_length test

* Add bpe direct assert test

* Update trainer.rs

clarified test documentation

* Creating the bindings.

* Fix the default.

* Re-adding missing package-lock which I accidentally removed.

* ..

* Fixing trainer test.

* Fix.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-16 10:08:19 +02:00
02ad59edc1 Never gonna run around and desert you 2023-05-12 16:27:06 +02:00
1cb44bd180 New version 0.13.3 2023-04-04 14:14:17 +02:00
3aaf4946b3 Add content to Strip decoder to allow decoding mid tokens. (#1199)
* Add `content` to Strip decoder to allow decoding mid tokens.

* Stub.

* Clippy.
2023-03-24 10:14:49 +01:00
e4aea890d5 Adding 2 new decoders: (#1196)
* Adding 2 new decoders:

- Fuse will simply concatenate all tokens into 1 string
- Strip will remove n char from left or right

Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want
for the `Metaspace` thing.

- Note: Added a new dependency from better parsing of decoders.
This is due to untagged enums which can match anything the `MustBe`
ensure there's no issue between Fuse and ByteFallback.
Since both are new the chances for backward incompatibility is low.

* Fixing picking/unpickling (using default args.).

* Stub.

* Black.

* Fixing node.
2023-03-24 00:50:54 +01:00
d2c8190a0f Creating normalizers.Prepend (To be used instead of Metaspace). (#1194)
* Creating `normalizers.Prepend` (To be used instead of `Metaspace`).

* Linting + stub.

* Fixing pickling/unpickling by setting a default.

* Black.
2023-03-24 00:33:31 +01:00
250d46c676 Adding Replace to decoder (to undo the Replace Normalizer for (#1195)
Metaspace split).
2023-03-23 23:43:47 +01:00
73637a0004 Adding ByteFallback support for tokenizers. (#1183)
* Adding ByteFallback support for `tokenizers`.

Two items added:

- A flag `byte_fallback` for the `BPE` model. This will be in charge
  of using `<0x61>` instead of unk on unknown tokens.
- A ByteFallback decoder, which will be in charge of putting everything
  back into string whenever possible. Showing � when the byte decoding
  fails (behavior checked against LlamaTokenizer in `transformers`.

* Update rustdoc.

* Clippy + Add BPE(byte_fallback) into bindings.

* Stupid file.

* Test artifacts removed.

* Update stub.

* Fix.

* Bad file.

* CRITICAL FIX: wrapper order because of untagged....

* Remove prints.

* Fixing <16 byte fallback.
2023-03-23 16:04:32 +01:00
fe4ae7dc38 Bump json5 from 2.2.0 to 2.2.3 in /bindings/node (#1140)
Bumps [json5](https://github.com/json5/json5) from 2.2.0 to 2.2.3.
- [Release notes](https://github.com/json5/json5/releases)
- [Changelog](https://github.com/json5/json5/blob/main/CHANGELOG.md)
- [Commits](https://github.com/json5/json5/compare/v2.2.0...v2.2.3)

---
updated-dependencies:
- dependency-name: json5
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-03 11:50:51 +01:00
4d520c9664 Ignore Cargo.lock for subfolders (#1131) 2022-12-25 11:35:47 +01:00
fbad581128 Bump derive_builder from 0.9 to 0.12 (#1129) 2022-12-23 23:37:16 +01:00
a408b44429 Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node (#1126)
Bumps [minimatch](https://github.com/isaacs/minimatch) from 3.0.4 to 3.1.2.
- [Release notes](https://github.com/isaacs/minimatch/releases)
- [Changelog](https://github.com/isaacs/minimatch/blob/main/changelog.md)
- [Commits](https://github.com/isaacs/minimatch/compare/v3.0.4...v3.1.2)

---
updated-dependencies:
- dependency-name: minimatch
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-19 14:09:24 +01:00
102dfe87a3 Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node (#1116)
Bumps [decode-uri-component](https://github.com/SamVerschueren/decode-uri-component) from 0.2.0 to 0.2.2.
- [Release notes](https://github.com/SamVerschueren/decode-uri-component/releases)
- [Commits](https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.0...v0.2.2)

---
updated-dependencies:
- dependency-name: decode-uri-component
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-05 18:09:38 +01:00
bbae829a72 Adding rust audit. (#1099)
* Adding rust audit.

* Update clap version + derive_builder (they clashed).

* Ignoring specific CVE which can be ignored

https://github.com/Azure/iot-identity-service/issues/481

* Updating python lock.

* Revert `derive-builder` update.

* Adding back help msg.
2022-11-09 12:59:36 +01:00
11bb2e00f2 Add python 3.11 to manylinux buildwheels (#1096)
* Add python 3.11 to manylinux buildwheels

* Fixing clippy.

* Node clippy.

* Python clippy.

* Changelog + version number update.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-11-07 08:45:04 +01:00
96a9e5715c New version. (#1082)
* New version.

The actual release will happen *before* PyO3 0.17.2 because
the tests were ran before than.

* Manylinux2014 necessary now with Rust 1.64.
2022-10-06 15:45:56 +02:00
6e5569a540 Moving versions numbers to dev mode. (#1067) 2022-09-22 18:24:07 +02:00
7bfab48979 Preparing rc1 release. (#1056)
* Preparing rc1 release.

* Fixing test_alignment_methods

* Fixing the overflowing sequence_id issue (LayoutLMv2 tests caught this).

* Adding overly complex overflowing test.
2022-09-12 16:07:06 +02:00
06025e4ca1 Adding Sequence for PostProcessor. (#1052)
* Adding `Sequence` for `PostProcessor`.

* Fixing node? Writing in the dark here, don't have Python2.7

* `undefined` is not accepted.

* Other test.
2022-08-25 14:50:06 +02:00
460bdded80 Modify Processor trait to support chaining. (#1054)
0 modifications yet, everything will consume the vector.
Every test should be green without any modifications.
2022-08-24 19:49:23 +02:00
b1c9bc68b5 Updating code according to clippy. (#1048)
- Adding `Eq` where possible
- Denied the ref deref warnings as it was spamming and solution not
  really better.
2022-08-24 19:45:15 +02:00
eb2213842b Update README.md (#1019)
* Update README.md

Add reference to normalizer blog post

* Update lib.rs

* Fixing PR + clippy on node.

* Update readme to match docstring.

* Other clippy warning.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-07-19 09:54:29 +02:00