Commit Graph

706 Commits

Author SHA1 Message Date
fb292d1eae 0.13.4.rc1 (#1319) 2023-08-14 12:06:43 +02:00
efea6c7246 Handle when precompiled charsmap is empty (#1308)
* Handle when precompiled charsmap is empty

* Black

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-07-31 14:35:24 +02:00
c2664ae13f Give error when initializing tokenizer with too high stride (#1306)
* Split `get_n_added_tokens` into separate method

* Modify `TokenizerImpl.with_truncation()` to raise an error if given bad parameters

* Return Python error if `tokenizer.with_truncation()` fails

* Add dummy variable assignment for `no_truncation()` case

* Unrelated fmt fix.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-07-28 09:16:44 +02:00
291b2e23ae Fixing clippy warnings on 1.71. (#1296)
* Fixing clippy warnings on 1.71.

* Fix.

* Fmt.

* Python clippy.

* Should really set my env back again.

* Fix.
2023-07-16 15:58:38 +02:00
4811f769a1 import Tuple from typing (#1295) 2023-07-14 17:39:29 +02:00
26659de473 revise type specification (#1289) 2023-07-06 16:36:48 +02:00
864135bef1 Add unigram bytefallback (#1217)
* current updates will go red

* cargo fmt

* npm install

* refactor train for unigram to allow bytefallbakc (breaking)

* fmt

* nits

* update

* add a proper test

* fix encode optimised fallback + add trainer arg

* fixes

* fixes

* fix tests

* add test

* fmt

* fix rust test

* update python bindings

* update

* pub is okay and needed

* more fix

* cleanup

* remove useles id

* MissingUnkId error

* nits

* fix offset

* add a test in python

* update src bindings

* remove bytefallback from trainer

* styling

* update pckg

* lint

* fmt

* stup with dev

* update code based on review

* remove unused function

* udpate python test to compare ids

* fix option bool issues

* final fix

* clippy

* fix npm isntall

* update

* update test

* more in depth testing

* Lint

* last attempt to fix node

* update node bindings

* fmt

* Update tokenizers/src/models/unigram/model.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* update based on review

* simpler test

* lint

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-06-26 10:46:59 +02:00
cb8d4de599 fix documentation regarding regex (#1264)
* fix documentation regarding regex

Split() in pre_tokenizers.rs and normalizations take a regex that is required to be built with a tokenizer specific regex module.
Clarify this in the documentation.

* Update __init__.pyi

fixed __init__.pyi

* Update bindings/python/py_src/tokenizers/__init__.pyi

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update bindings/python/py_src/tokenizers/__init__.pyi

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Revert "Update bindings/python/py_src/tokenizers/__init__.pyi"

This reverts commit 6e8bdfcddf67bcdd8e3b1a78685fd5ef8f6a153c.

* Revert "Update bindings/python/py_src/tokenizers/__init__.pyi"

This reverts commit 897b0c0de471ad7cb6269b8456347c4e5cff2aaf.

* Revert "Update __init__.pyi"

This reverts commit fbe82310b7728ee7cdb6f8b38fbc2388f9d95771.

* add codeblocks the right way

* add codeblocks with stub.py

ran setup.py install to build, and then ran stub.py

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2023-06-07 09:41:28 +02:00
b4fcc9ce6e Makes decode and decode_batch work on borrowed content. (#1251)
* Makes `decode` and `decode_batch` work on borrowed content.

* Make `decode_batch` work with borrowed content.

* Fix lint.

* Attempt to map it into Node.

* Second attempt.

* Step by step.

* One more step.

* Fix lint.

* Please ...

* Removing collect.

* Revert "Removing collect."

This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-17 11:18:15 +02:00
cefc41e8ec implement a simple max_sentencepiece_length into BPE (#1228)
* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* utilize Option<u16> for safer code.

* Other version.

* Update trainer.rs

clarify with type usize propagate max_length option

* change max_length into more descriptive name

in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.

* change variable name in trainer.rs

change max_merge_length into max_token_length

* Update trainer.rs

add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer

Add explanation for variable shadowing.

* Update trainer.rs

Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.

* add max_token_length test

* Add bpe direct assert test

* Update trainer.rs

clarified test documentation

* Creating the bindings.

* Fix the default.

* Re-adding missing package-lock which I accidentally removed.

* ..

* Fixing trainer test.

* Fix.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-16 10:08:19 +02:00
ef5f50605d Printing warning to stderr. (#1222) 2023-04-19 14:55:24 +02:00
ce244bd094 remove rc1 2023-04-04 16:19:42 +02:00
1cb44bd180 New version 0.13.3 2023-04-04 14:14:17 +02:00
3aaf4946b3 Add content to Strip decoder to allow decoding mid tokens. (#1199)
* Add `content` to Strip decoder to allow decoding mid tokens.

* Stub.

* Clippy.
2023-03-24 10:14:49 +01:00
e4aea890d5 Adding 2 new decoders: (#1196)
* Adding 2 new decoders:

- Fuse will simply concatenate all tokens into 1 string
- Strip will remove n char from left or right

Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want
for the `Metaspace` thing.

- Note: Added a new dependency from better parsing of decoders.
This is due to untagged enums which can match anything the `MustBe`
ensure there's no issue between Fuse and ByteFallback.
Since both are new the chances for backward incompatibility is low.

* Fixing picking/unpickling (using default args.).

* Stub.

* Black.

* Fixing node.
2023-03-24 00:50:54 +01:00
d2c8190a0f Creating normalizers.Prepend (To be used instead of Metaspace). (#1194)
* Creating `normalizers.Prepend` (To be used instead of `Metaspace`).

* Linting + stub.

* Fixing pickling/unpickling by setting a default.

* Black.
2023-03-24 00:33:31 +01:00
250d46c676 Adding Replace to decoder (to undo the Replace Normalizer for (#1195)
Metaspace split).
2023-03-23 23:43:47 +01:00
178e294a6a Merge pull request #1192 from huggingface/faster-datasets-train-example
Faster `datasets` train example
2023-03-23 17:19:05 +01:00
73637a0004 Adding ByteFallback support for tokenizers. (#1183)
* Adding ByteFallback support for `tokenizers`.

Two items added:

- A flag `byte_fallback` for the `BPE` model. This will be in charge
  of using `<0x61>` instead of unk on unknown tokens.
- A ByteFallback decoder, which will be in charge of putting everything
  back into string whenever possible. Showing � when the byte decoding
  fails (behavior checked against LlamaTokenizer in `transformers`.

* Update rustdoc.

* Clippy + Add BPE(byte_fallback) into bindings.

* Stupid file.

* Test artifacts removed.

* Update stub.

* Fix.

* Bad file.

* CRITICAL FIX: wrapper order because of untagged....

* Remove prints.

* Fixing <16 byte fallback.
2023-03-23 16:04:32 +01:00
e76f900bc0 Faster datasets train example
Using .iter() is much faster than accessing using row ids
2023-03-23 11:24:30 +01:00
5c18ec5ff5 pyo3 v0.18 migration (#1173)
* pyo v0.18 migration

* Fix formatting issues of black
2023-03-08 11:27:47 +01:00
9b155b5723 [FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. (#1136)
* [fix] Use unk_token

In SentencePieceBPETokenizer, when Vocab or  merges is None, unk_token cannot be used.

* [fix] If unk_token is None, this case is also considered.

* Update bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* [FIX] In CharBPETokenizer, Use unk_token.

In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used.

* Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* Update bindings/python/py_src/tokenizers/implementations/char_level_bpe.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-12-27 11:13:52 +01:00
4d520c9664 Ignore Cargo.lock for subfolders (#1131) 2022-12-25 11:35:47 +01:00
fbad581128 Bump derive_builder from 0.9 to 0.12 (#1129) 2022-12-23 23:37:16 +01:00
9a25b2cb8e [FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. (#1120)
* [fix] Use unk_token

In SentencePieceBPETokenizer, when Vocab or  merges is None, unk_token cannot be used.

* [fix] If unk_token is None, this case is also considered.

* Update bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-12-19 13:40:04 +01:00
bbae829a72 Adding rust audit. (#1099)
* Adding rust audit.

* Update clap version + derive_builder (they clashed).

* Ignoring specific CVE which can be ignored

https://github.com/Azure/iot-identity-service/issues/481

* Updating python lock.

* Revert `derive-builder` update.

* Adding back help msg.
2022-11-09 12:59:36 +01:00
b8a4aa6000 Fixing extra wheels memory usage. (#1098) 2022-11-07 09:11:18 +01:00
11bb2e00f2 Add python 3.11 to manylinux buildwheels (#1096)
* Add python 3.11 to manylinux buildwheels

* Fixing clippy.

* Node clippy.

* Python clippy.

* Changelog + version number update.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-11-07 08:45:04 +01:00
96a9e5715c New version. (#1082)
* New version.

The actual release will happen *before* PyO3 0.17.2 because
the tests were ran before than.

* Manylinux2014 necessary now with Rust 1.64.
2022-10-06 15:45:56 +02:00
8129dd3309 pyo3: update to 0.17 (#1066)
* python: update bindings to edition 2021

* python: update to pyo3 0.17

* Updating testing.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-10-05 16:59:01 +02:00
6113666624 Updating python formatting. (#1079)
* Updating python formatting.

* Forgot gh action.

* Skipping isort to prevent circular imports.

* Updating stub.

* Removing `isort` (it contradicts `stub.py`).

* Fixing weird stub black/isort disagreeement.
2022-10-05 15:29:33 +02:00
5f6e978452 Fixing roberta type id (everything is zero). (#1072)
* Fixing roberta type ids (everything is zero).

* We need to fix type_ids for all sequence even when not changing

anything else.

* Fixing tests hopefully better.
2022-09-26 18:00:41 +02:00
6e5569a540 Moving versions numbers to dev mode. (#1067) 2022-09-22 18:24:07 +02:00
63082c4d11 Enabling static interpreter embedding for manylinux. (#1064)
* Removing dead file.

* Checking that we can distribute with static python embedding for

manylinux

* Many linux embed interpreter.

* Building wheels manylinux with static embedding

* Better script.

* typo.

* Using a dummy feature?

* default features ?

* Back into order.

* Fixing manylinux ??.

* Local dir.

* Missing star.

* Makedir ?

* Monkey coding this.

* extension module ?

* Building with default features `RustExtension`.

* bdist_wheel + rustextension any better ?

* update rust-py version.

* Forcing extension module.

* No default features.

* Remove py37 out of spite

* Revert "Remove py37 out of spite"

This reverts commit 6ab7facd792b59c2e30be82fe42816d24c32cf0d.

* Really extraneous feature.

* Fix build wheels.

* Putting things back in place.
2022-09-21 12:18:46 +02:00
655f4057b7 Removing python3.6 from manylinux it's not supported anymore. (#1063) 2022-09-19 12:22:02 +02:00
7bfab48979 Preparing rc1 release. (#1056)
* Preparing rc1 release.

* Fixing test_alignment_methods

* Fixing the overflowing sequence_id issue (LayoutLMv2 tests caught this).

* Adding overly complex overflowing test.
2022-09-12 16:07:06 +02:00
06025e4ca1 Adding Sequence for PostProcessor. (#1052)
* Adding `Sequence` for `PostProcessor`.

* Fixing node? Writing in the dark here, don't have Python2.7

* `undefined` is not accepted.

* Other test.
2022-08-25 14:50:06 +02:00
460bdded80 Modify Processor trait to support chaining. (#1054)
0 modifications yet, everything will consume the vector.
Every test should be green without any modifications.
2022-08-24 19:49:23 +02:00
b1c9bc68b5 Updating code according to clippy. (#1048)
- Adding `Eq` where possible
- Denied the ref deref warnings as it was spamming and solution not
  really better.
2022-08-24 19:45:15 +02:00
adf90dcd72 Adding unstable_wasm feature + example to run tokenizers on wasm. (#1009)
* Adding `unstable_wasm` feature + example to run `tokenizers` on wasm.

Co-Authored-By: josephrocca <1167575+josephrocca@users.noreply.github.com>
Co-Authored-By: Matthias Brunel <matthias.brunel@mithrilsecurity.io>

* Adding some serialization tests.

* Updating with comments.

Co-authored-by: josephrocca <1167575+josephrocca@users.noreply.github.com>
Co-authored-by: Matthias Brunel <matthias.brunel@mithrilsecurity.io>
2022-06-10 14:58:02 +02:00
943b5421aa Changing Decoder trait to be more composable. (#938) (#1008)
* Changing `Decoder` trait to be more composable. (#938)

* Changing `Decoder` trait to be more composable.

Fix #872

* Fixing Python side.

* Fixing test.

* Updating cleanup signature, removing turbofish.

* Adding `Sequence` Decoder.
2022-06-02 14:43:42 +02:00
519cc13be0 Upgrade pyo3 to 0.16 (#956)
* Upgrade pyo3 to 0.15

Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>

* Upgrade pyo3 to 0.16

Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>

* Install Python before running cargo clippy

* Fix clippy warnings

* Use `PyArray_Check` instead of downcasting to `PyArray1<u8>`

* Enable `auto-initialize` of pyo3 to fix `cargo test
--no-default-features`

* Fix some test cases

Why do they change?

* Refactor and add SAFETY comments to `PyArrayUnicode`

Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData`

Co-authored-by: messense <messense@icloud.com>
2022-05-05 15:48:40 +02:00
e6cd73a291 .dev0 suffix in python version (#987) 2022-04-22 09:36:18 +02:00
95b5d066d5 Update doc build gh workflow to install rust 2022-04-21 09:20:20 +02:00
c2aa87a256 Add setup.py extras["dev"] 2022-04-19 15:14:44 +02:00
66c9af26f6 Fixing the documentation for ByteLevel in Python (#982)
* Fixing the documentation for `ByteLevel` in Python

* Python stub.py (after rebuilding ofc).
2022-04-14 16:29:50 +02:00
8a9bb28f46 Preparing for 0.12.1 (#978)
* Preparing for 0.12.1

* Updated the changelog.
2022-04-12 17:57:33 +02:00
ec43947786 Revert "Changing Decoder trait to be more composable. (#938)" (#971)
This reverts commit cdabef14c4.
2022-04-04 09:43:28 +02:00
0eb7455fe5 Preparing 0.12 release. (#967)
* Preparing `0.12` release.

* Fix click version: https://github.com/psf/black/issues/2964
2022-03-31 11:06:33 +02:00
a5f644616b Fix the error test for Python 3.10 (error message is different). (#962) 2022-03-23 10:35:58 +01:00