Commit Graph

1703 Commits

Author SHA1 Message Date
2b72017e17 correctly compute the new id: we take the max of the AddedToken + get_vocab_size 2023-09-01 19:03:33 +00:00
db319492f7 clippy 2023-09-01 18:57:39 +00:00
2dca476810 fix some tests 2023-09-01 18:48:50 +00:00
6cca5716af fix one test? 2023-09-01 18:42:30 +00:00
345b4eba96 updates 2023-09-01 18:41:36 +00:00
8e522a38d9 Updating the docs with the new command. (#1333) 2023-08-29 13:15:26 +02:00
d2010d5165 Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331)
* Move to maturing mimicking move for `safetensors`.

* Tmp.

* Fix sdist.

* Wat?

* Clippy 1.72

* Remove if.

* Conda sed.

* Fix doc check workflow.

* Moving to maturin AND removing http + openssl mess (smoothing transition
moving to `huggingface_hub`)

* Fix dep

* Black.

* New node bindings.

* Fix docs + node cache ?

* Yarn.

* Working dir.

* Extension module.

* Put back interpreter.

* Remove cache.

* New attempt

* Multi python.

* Remove FromPretrained.

* Remove traces of `fromPretrained`.

* Drop 3.12 for windows?

* Typo.

* Put back the default feature for ignoring links during simple test.

* Fix ?

* x86_64 -> x64.

* Remove warning for windows bindings.

* Excluse aarch.

* Include/exclude.

* Put back workflows in correct states.
2023-08-28 16:24:14 +02:00
f2952020d5 Python 38 arm (#1330) 2023-08-23 16:29:16 +02:00
f08058ab2b Reduce number of different revisions by 1 (#1329) 2023-08-23 15:57:36 +02:00
6c350d88fe Re-using scritpts from safetensors. (#1328) 2023-08-23 15:37:38 +02:00
d0bb35d5a6 Merge pull request #1316 from boyleconnor/add-expect-for-no-truncation
Add `expect()` for disabling truncation
2023-08-18 19:30:53 +02:00
540bf2eb01 pyo3: update to 0.19 (#1322)
* Bump pyo3 dependency versions

* Fix deprecation warnings from pyo3

---------

Co-authored-by: Mike Lui <mikelui@meta.com>
2023-08-16 18:40:32 +02:00
9a93c50c25 Fix stride condition. (#1321)
* Release all at once for simplicity.

* rc2
2023-08-14 15:27:55 +02:00
b35d33f981 Release all at once for simplicity. (#1320) 2023-08-14 13:49:45 +02:00
fb292d1eae 0.13.4.rc1 (#1319) 2023-08-14 12:06:43 +02:00
862046ac94 CD backports (#1318)
* CD backports

follow
huggingface/safetensors#317

* fix node bindings?

`cargo check` doesnt work on my local configuration from `tokenizers/bindings/node/native`
i don't think it will be a problem but i have difficulty telling

* backport #315

* safetensors#317 back ports
2023-08-10 18:52:22 +02:00
748556a9ed Fix code style 2023-08-07 15:17:43 -07:00
d47d3e377c Derive clone for TrainerWrapper (#1317) 2023-08-07 15:15:10 +02:00
a0a8ebe03f Add expect() for disabling truncation 2023-08-06 13:25:50 -07:00
efea6c7246 Handle when precompiled charsmap is empty (#1308)
* Handle when precompiled charsmap is empty

* Black

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-07-31 14:35:24 +02:00
c2664ae13f Give error when initializing tokenizer with too high stride (#1306)
* Split `get_n_added_tokens` into separate method

* Modify `TokenizerImpl.with_truncation()` to raise an error if given bad parameters

* Return Python error if `tokenizer.with_truncation()` fails

* Add dummy variable assignment for `no_truncation()` case

* Unrelated fmt fix.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-07-28 09:16:44 +02:00
bb38f390a6 Single warning for holes. (#1303)
* Single warning for holes.

* Dummy.
2023-07-25 12:57:23 +02:00
d6326b2b88 feat: Added CITATION.cff. (#1302) 2023-07-25 12:16:09 +02:00
ea4d3f634c Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node (#1299)
Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.4.
- [Release notes](https://github.com/jonschlinkert/word-wrap/releases)
- [Commits](https://github.com/jonschlinkert/word-wrap/compare/1.2.3...1.2.4)

---
updated-dependencies:
- dependency-name: word-wrap
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-21 08:08:10 +02:00
291b2e23ae Fixing clippy warnings on 1.71. (#1296)
* Fixing clippy warnings on 1.71.

* Fix.

* Fmt.

* Python clippy.

* Should really set my env back again.

* Fix.
2023-07-16 15:58:38 +02:00
4811f769a1 import Tuple from typing (#1295) 2023-07-14 17:39:29 +02:00
150559b61e master -> main (#1292) 2023-07-12 11:51:22 +02:00
92bfb9c993 Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node (#1291)
Bumps [tough-cookie](https://github.com/salesforce/tough-cookie) from 4.0.0 to 4.1.3.
- [Release notes](https://github.com/salesforce/tough-cookie/releases)
- [Changelog](https://github.com/salesforce/tough-cookie/blob/master/CHANGELOG.md)
- [Commits](https://github.com/salesforce/tough-cookie/compare/v4.0.0...v4.1.3)

---
updated-dependencies:
- dependency-name: tough-cookie
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-10 09:44:31 +02:00
26659de473 revise type specification (#1289) 2023-07-06 16:36:48 +02:00
864135bef1 Add unigram bytefallback (#1217)
* current updates will go red

* cargo fmt

* npm install

* refactor train for unigram to allow bytefallbakc (breaking)

* fmt

* nits

* update

* add a proper test

* fix encode optimised fallback + add trainer arg

* fixes

* fixes

* fix tests

* add test

* fmt

* fix rust test

* update python bindings

* update

* pub is okay and needed

* more fix

* cleanup

* remove useles id

* MissingUnkId error

* nits

* fix offset

* add a test in python

* update src bindings

* remove bytefallback from trainer

* styling

* update pckg

* lint

* fmt

* stup with dev

* update code based on review

* remove unused function

* udpate python test to compare ids

* fix option bool issues

* final fix

* clippy

* fix npm isntall

* update

* update test

* more in depth testing

* Lint

* last attempt to fix node

* update node bindings

* fmt

* Update tokenizers/src/models/unigram/model.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* update based on review

* simpler test

* lint

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-06-26 10:46:59 +02:00
8c9cfb0b68 Improve error for truncation with too high stride (#1275) 2023-06-12 10:38:42 +02:00
348ed70e58 [doc build] Use secrets (#1273) 2023-06-09 12:58:27 +02:00
5d70f15bfb Update README.md - Broken link (#1272)
* Update README.md - Broken link

fixed "python documentation" link

* Update README.md

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2023-06-08 10:20:11 +02:00
f85e8467e4 Update Cargo.toml (#1266)
`cargo update` yield the following
Updating regex v1.8.1 -> v1.8.3
Updating regex-syntax v0.7.1 -> v0.7.2
2023-06-07 09:57:18 +02:00
cb8d4de599 fix documentation regarding regex (#1264)
* fix documentation regarding regex

Split() in pre_tokenizers.rs and normalizations take a regex that is required to be built with a tokenizer specific regex module.
Clarify this in the documentation.

* Update __init__.pyi

fixed __init__.pyi

* Update bindings/python/py_src/tokenizers/__init__.pyi

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update bindings/python/py_src/tokenizers/__init__.pyi

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Revert "Update bindings/python/py_src/tokenizers/__init__.pyi"

This reverts commit 6e8bdfcddf67bcdd8e3b1a78685fd5ef8f6a153c.

* Revert "Update bindings/python/py_src/tokenizers/__init__.pyi"

This reverts commit 897b0c0de471ad7cb6269b8456347c4e5cff2aaf.

* Revert "Update __init__.pyi"

This reverts commit fbe82310b7728ee7cdb6f8b38fbc2388f9d95771.

* add codeblocks the right way

* add codeblocks with stub.py

ran setup.py install to build, and then ran stub.py

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2023-06-07 09:41:28 +02:00
c7102c4c0f Fixing broken link. (#1268) 2023-06-06 11:10:28 +02:00
cb819724ef Update trainer.rs (#1257)
implement skip for empty sentences.
refer to :
https://github.com/google/sentencepiece/blob/master/src/trainer_interface.cc#L373
2023-05-25 12:24:29 +02:00
fc76ad4f07 Parallelize unigram trainer (#976)
* Parallelize unigram trainer

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

* Rm unused lifetime

---------

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
2023-05-22 15:36:03 +02:00
a03330607b Update all GH Actions with dependency on actions/checkout from v[1,2] to v3 to notably improve performance (retrieve only the commit being checked-out) (#1256) 2023-05-22 14:50:00 +02:00
b4fcc9ce6e Makes decode and decode_batch work on borrowed content. (#1251)
* Makes `decode` and `decode_batch` work on borrowed content.

* Make `decode_batch` work with borrowed content.

* Fix lint.

* Attempt to map it into Node.

* Second attempt.

* Step by step.

* One more step.

* Fix lint.

* Please ...

* Removing collect.

* Revert "Removing collect."

This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-17 11:18:15 +02:00
cefc41e8ec implement a simple max_sentencepiece_length into BPE (#1228)
* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* utilize Option<u16> for safer code.

* Other version.

* Update trainer.rs

clarify with type usize propagate max_length option

* change max_length into more descriptive name

in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.

* change variable name in trainer.rs

change max_merge_length into max_token_length

* Update trainer.rs

add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer

Add explanation for variable shadowing.

* Update trainer.rs

Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.

* add max_token_length test

* Add bpe direct assert test

* Update trainer.rs

clarified test documentation

* Creating the bindings.

* Fix the default.

* Re-adding missing package-lock which I accidentally removed.

* ..

* Fixing trainer test.

* Fix.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-16 10:08:19 +02:00
daf3fcc976 Rvert main hiccup. 2023-05-15 18:01:29 +02:00
b58227c7f1 Never gonna make you cry 2023-05-12 16:28:57 +02:00
02ad59edc1 Never gonna run around and desert you 2023-05-12 16:27:06 +02:00
8d07696c38 Never gonna let you down 2023-05-12 16:24:26 +02:00
4518b0f7f2 fix unigram.rs test_sample() (#1244)
87230bb59b/tokenizers/tests/unigram.rs (LL71C1-L71C53)

When running cargo test --release, the above line causes an error.

referring to 87230bb59b/tokenizers/src/models/unigram/lattice.rs (L138)

It seems that lattice::from should only take 3 arguments.
If i had to guess, it should be Lattice::from("ABC", 0, 2);
This change makes cargo test --release pass without error.
2023-05-10 17:04:34 +02:00
87230bb59b use LTO for release and benchmark builds (#1157) 2023-05-09 16:15:57 +02:00
15085ef905 Fixing padding_left sequence_ids. (#1233) 2023-05-04 15:57:20 +02:00
ef5f50605d Printing warning to stderr. (#1222) 2023-04-19 14:55:24 +02:00
d19bc63c67 Merge pull request #1212 from huggingface/fix-node-release
Fix node release
2023-04-06 16:25:29 +02:00