Commit Graph

1672 Commits

Author SHA1 Message Date
348ed70e58 [doc build] Use secrets (#1273) 2023-06-09 12:58:27 +02:00
5d70f15bfb Update README.md - Broken link (#1272)
* Update README.md - Broken link

fixed "python documentation" link

* Update README.md

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2023-06-08 10:20:11 +02:00
f85e8467e4 Update Cargo.toml (#1266)
`cargo update` yield the following
Updating regex v1.8.1 -> v1.8.3
Updating regex-syntax v0.7.1 -> v0.7.2
2023-06-07 09:57:18 +02:00
cb8d4de599 fix documentation regarding regex (#1264)
* fix documentation regarding regex

Split() in pre_tokenizers.rs and normalizations take a regex that is required to be built with a tokenizer specific regex module.
Clarify this in the documentation.

* Update __init__.pyi

fixed __init__.pyi

* Update bindings/python/py_src/tokenizers/__init__.pyi

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update bindings/python/py_src/tokenizers/__init__.pyi

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Revert "Update bindings/python/py_src/tokenizers/__init__.pyi"

This reverts commit 6e8bdfcddf67bcdd8e3b1a78685fd5ef8f6a153c.

* Revert "Update bindings/python/py_src/tokenizers/__init__.pyi"

This reverts commit 897b0c0de471ad7cb6269b8456347c4e5cff2aaf.

* Revert "Update __init__.pyi"

This reverts commit fbe82310b7728ee7cdb6f8b38fbc2388f9d95771.

* add codeblocks the right way

* add codeblocks with stub.py

ran setup.py install to build, and then ran stub.py

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2023-06-07 09:41:28 +02:00
c7102c4c0f Fixing broken link. (#1268) 2023-06-06 11:10:28 +02:00
cb819724ef Update trainer.rs (#1257)
implement skip for empty sentences.
refer to :
https://github.com/google/sentencepiece/blob/master/src/trainer_interface.cc#L373
2023-05-25 12:24:29 +02:00
fc76ad4f07 Parallelize unigram trainer (#976)
* Parallelize unigram trainer

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

* Rm unused lifetime

---------

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
2023-05-22 15:36:03 +02:00
a03330607b Update all GH Actions with dependency on actions/checkout from v[1,2] to v3 to notably improve performance (retrieve only the commit being checked-out) (#1256) 2023-05-22 14:50:00 +02:00
b4fcc9ce6e Makes decode and decode_batch work on borrowed content. (#1251)
* Makes `decode` and `decode_batch` work on borrowed content.

* Make `decode_batch` work with borrowed content.

* Fix lint.

* Attempt to map it into Node.

* Second attempt.

* Step by step.

* One more step.

* Fix lint.

* Please ...

* Removing collect.

* Revert "Removing collect."

This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-17 11:18:15 +02:00
cefc41e8ec implement a simple max_sentencepiece_length into BPE (#1228)
* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* utilize Option<u16> for safer code.

* Other version.

* Update trainer.rs

clarify with type usize propagate max_length option

* change max_length into more descriptive name

in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.

* change variable name in trainer.rs

change max_merge_length into max_token_length

* Update trainer.rs

add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer

Add explanation for variable shadowing.

* Update trainer.rs

Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.

* add max_token_length test

* Add bpe direct assert test

* Update trainer.rs

clarified test documentation

* Creating the bindings.

* Fix the default.

* Re-adding missing package-lock which I accidentally removed.

* ..

* Fixing trainer test.

* Fix.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-16 10:08:19 +02:00
daf3fcc976 Rvert main hiccup. 2023-05-15 18:01:29 +02:00
b58227c7f1 Never gonna make you cry 2023-05-12 16:28:57 +02:00
02ad59edc1 Never gonna run around and desert you 2023-05-12 16:27:06 +02:00
8d07696c38 Never gonna let you down 2023-05-12 16:24:26 +02:00
4518b0f7f2 fix unigram.rs test_sample() (#1244)
87230bb59b/tokenizers/tests/unigram.rs (LL71C1-L71C53)

When running cargo test --release, the above line causes an error.

referring to 87230bb59b/tokenizers/src/models/unigram/lattice.rs (L138)

It seems that lattice::from should only take 3 arguments.
If i had to guess, it should be Lattice::from("ABC", 0, 2);
This change makes cargo test --release pass without error.
2023-05-10 17:04:34 +02:00
87230bb59b use LTO for release and benchmark builds (#1157) 2023-05-09 16:15:57 +02:00
15085ef905 Fixing padding_left sequence_ids. (#1233) 2023-05-04 15:57:20 +02:00
ef5f50605d Printing warning to stderr. (#1222) 2023-04-19 14:55:24 +02:00
d19bc63c67 Merge pull request #1212 from huggingface/fix-node-release
Fix node release
2023-04-06 16:25:29 +02:00
a714aac6f6 revert changes 2023-04-06 14:07:46 +00:00
ceb73dbd29 publish npm 2023-04-06 13:35:29 +00:00
42b110587c Fix conda release (#1211)
* runs-on: ubuntu-latest

* update jobs

* remove uploads

* upload conda package, remove python-release CI

* revert changes

* revart whitespace removal
2023-04-06 12:30:14 +02:00
fbd8d6188e update for testing 2023-04-06 10:29:42 +00:00
37372b67fa Merge pull request #1207 from huggingface/v0.13.3
New release
2023-04-05 09:58:19 +02:00
ce244bd094 remove rc1 2023-04-04 16:19:42 +02:00
a05be6b8d1 Merge pull request #1205 from huggingface/new_version
New version 0.13.3
2023-04-04 15:03:38 +02:00
1cb44bd180 New version 0.13.3 2023-04-04 14:14:17 +02:00
3aaf4946b3 Add content to Strip decoder to allow decoding mid tokens. (#1199)
* Add `content` to Strip decoder to allow decoding mid tokens.

* Stub.

* Clippy.
2023-03-24 10:14:49 +01:00
8a6a8dc9d5 Fixing decoder strip because of char boundaries. (#1197) 2023-03-24 01:57:39 +01:00
e4aea890d5 Adding 2 new decoders: (#1196)
* Adding 2 new decoders:

- Fuse will simply concatenate all tokens into 1 string
- Strip will remove n char from left or right

Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want
for the `Metaspace` thing.

- Note: Added a new dependency from better parsing of decoders.
This is due to untagged enums which can match anything the `MustBe`
ensure there's no issue between Fuse and ByteFallback.
Since both are new the chances for backward incompatibility is low.

* Fixing picking/unpickling (using default args.).

* Stub.

* Black.

* Fixing node.
2023-03-24 00:50:54 +01:00
d2c8190a0f Creating normalizers.Prepend (To be used instead of Metaspace). (#1194)
* Creating `normalizers.Prepend` (To be used instead of `Metaspace`).

* Linting + stub.

* Fixing pickling/unpickling by setting a default.

* Black.
2023-03-24 00:33:31 +01:00
250d46c676 Adding Replace to decoder (to undo the Replace Normalizer for (#1195)
Metaspace split).
2023-03-23 23:43:47 +01:00
178e294a6a Merge pull request #1192 from huggingface/faster-datasets-train-example
Faster `datasets` train example
2023-03-23 17:19:05 +01:00
73637a0004 Adding ByteFallback support for tokenizers. (#1183)
* Adding ByteFallback support for `tokenizers`.

Two items added:

- A flag `byte_fallback` for the `BPE` model. This will be in charge
  of using `<0x61>` instead of unk on unknown tokens.
- A ByteFallback decoder, which will be in charge of putting everything
  back into string whenever possible. Showing � when the byte decoding
  fails (behavior checked against LlamaTokenizer in `transformers`.

* Update rustdoc.

* Clippy + Add BPE(byte_fallback) into bindings.

* Stupid file.

* Test artifacts removed.

* Update stub.

* Fix.

* Bad file.

* CRITICAL FIX: wrapper order because of untagged....

* Remove prints.

* Fixing <16 byte fallback.
2023-03-23 16:04:32 +01:00
e76f900bc0 Faster datasets train example
Using .iter() is much faster than accessing using row ids
2023-03-23 11:24:30 +01:00
b8fbea00a9 Bump dirs from 3.0 to 4.0 (#1142) 2023-03-21 10:32:02 +01:00
5ecd329503 Fixing infinite loop in UnigramTrainer. (#1182)
* Fixing infinite loop in UnigramTrainer.

* Newer clippy.
2023-03-15 14:59:01 +01:00
9c0e700212 Bump webpack in /tokenizers/examples/unstable_wasm/www (#1181)
Bumps [webpack](https://github.com/webpack/webpack) from 5.75.0 to 5.76.0.
- [Release notes](https://github.com/webpack/webpack/releases)
- [Commits](https://github.com/webpack/webpack/compare/v5.75.0...v5.76.0)

---
updated-dependencies:
- dependency-name: webpack
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-03-15 10:54:26 +01:00
5c18ec5ff5 pyo3 v0.18 migration (#1173)
* pyo v0.18 migration

* Fix formatting issues of black
2023-03-08 11:27:47 +01:00
3138657565 Using clippy 1.67 (#1167) 2023-03-02 12:28:39 +01:00
ac552ff8b9 Update model.rs (#1166) 2023-02-28 17:35:57 +01:00
fa66caf0ab Improved version. (#1154)
* Improved version.

* Clippy.
2023-01-23 16:35:19 +01:00
d09241fba1 Prevent using from_pretrained on invalid ids (better error message). (#1153) 2023-01-23 15:38:14 +01:00
b861d48b06 Making Tokenizer clone. (#1152) 2023-01-23 10:12:35 +01:00
1fcd90b0b7 Update info on environment variable for threading (#1150)
* Update env var name for threading

* Update env var name for threading
2023-01-22 21:24:41 +01:00
33a57e6418 Made dirs optional (#1148) 2023-01-18 09:29:15 +01:00
daf8aebd76 Adding python 3.8 for M1 (#1147) 2023-01-16 16:40:46 +01:00
5a94a2b6e7 Add missing build targets (#1145)
* M1 3.11 was not out neither windows amd64.

* python@v4.

* Actually upload.

* Update needs.

* Preparing the actual PR.
2023-01-15 10:18:08 +01:00
fe4ae7dc38 Bump json5 from 2.2.0 to 2.2.3 in /bindings/node (#1140)
Bumps [json5](https://github.com/json5/json5) from 2.2.0 to 2.2.3.
- [Release notes](https://github.com/json5/json5/releases)
- [Changelog](https://github.com/json5/json5/blob/main/CHANGELOG.md)
- [Commits](https://github.com/json5/json5/compare/v2.2.0...v2.2.3)

---
updated-dependencies:
- dependency-name: json5
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-03 11:50:51 +01:00
c3fedd96b3 Bump json5, copy-webpack-plugin, webpack and webpack-cli (#1139)
Removes [json5](https://github.com/json5/json5). It's no longer used after updating ancestor dependencies [json5](https://github.com/json5/json5), [copy-webpack-plugin](https://github.com/webpack-contrib/copy-webpack-plugin), [webpack](https://github.com/webpack/webpack) and [webpack-cli](https://github.com/webpack/webpack-cli). These dependencies need to be updated together.


Removes `json5`

Updates `copy-webpack-plugin` from 5.1.2 to 11.0.0
- [Release notes](https://github.com/webpack-contrib/copy-webpack-plugin/releases)
- [Changelog](https://github.com/webpack-contrib/copy-webpack-plugin/blob/master/CHANGELOG.md)
- [Commits](https://github.com/webpack-contrib/copy-webpack-plugin/compare/v5.1.2...v11.0.0)

Updates `webpack` from 4.46.0 to 5.75.0
- [Release notes](https://github.com/webpack/webpack/releases)
- [Commits](https://github.com/webpack/webpack/compare/v4.46.0...v5.75.0)

Updates `webpack-cli` from 3.3.12 to 5.0.1
- [Release notes](https://github.com/webpack/webpack-cli/releases)
- [Changelog](https://github.com/webpack/webpack-cli/blob/master/CHANGELOG.md)
- [Commits](https://github.com/webpack/webpack-cli/compare/v3.3.12...webpack-cli@5.0.1)

---
updated-dependencies:
- dependency-name: json5
  dependency-type: indirect
- dependency-name: copy-webpack-plugin
  dependency-type: direct:development
- dependency-name: webpack
  dependency-type: direct:development
- dependency-name: webpack-cli
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-03 10:22:49 +01:00