Commit Graph

144 Commits

Author SHA1 Message Date
cc5fb01a2f Decode stream python (#1678)
* Python binding for decode stream

Different API because Python cannot handle lifetimes properly.

* Clippy.
2024-11-15 12:06:22 +01:00
49dafd707e Fix strip python type (#1602)
* update

* the fix

* Revert "update"

This reverts commit 4c2f32f116479b0ec8ccd7c832f86cbc8787d8a9.

* add a test and rebase

* style

* oups
2024-08-07 15:36:28 +02:00
bded212356 Support None to reset pre_tokenizers and normalizers, and index sequences (#1590)
* initial commit

* support None

* fix clippy

* cleanup

* clean?

* propagate to pre_tokenizer

* fix test

* fix rust tests

* fix node

* propagate to decoder and post processor

* fix calls

* lint

* fmt

* node be happy I am fixing you

* initial commit

* support None

* fix clippy

* cleanup

* clean?

* propagate to pre_tokenizer

* fix test

* fix rust tests

* fix node

* propagate to decoder and post processor

* fix calls

* lint

* fmt

* node be happy I am fixing you

* add a small test

* styling

* style merge

* fix merge test

* fmt

* nits

* update tset
2024-08-07 12:52:35 +02:00
ab9c7ded8b Using serde (serde_pyo3) to get __str__ and __repr__ easily. (#1588)
* Using serde (serde_pyo3) to get __str__ and __repr__ easily.

* Putting it within tokenizers, it needs to be too specific.

* Clippy is our friend.

* Ruff.

* Update the tests.

* Pretty sure this is wrong (#1589)

* Adding support for ellipsis.

* Fmt.

* Ruff.

* Fixing tokenizer.

---------

Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
2024-08-07 12:08:29 +02:00
4ea2f235b0 Add bytelevel normalizer to fix decode when adding tokens to BPE (#1555)
* feature dependent test

* nit about 嗎

* update

* actuallyfix it

* update the test

add it

fix

* stub

* Update tokenizers/src/pre_tokenizers/byte_level.rs

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* skip failing test

* add normalizer to init

---------

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
2024-07-15 12:12:03 +02:00
fdd26ba9a3 Enable dropout = 0.0 as an equivalent to none in BPE (#1550)
* enable dropout = 0.0

* typo

* lint

* formatter

* enable dropout = 0.0

* formatter
2024-06-24 12:36:11 +02:00
88f51fe7d2 Switch from cached_download to hf_hub_download in tests (#1547) 2024-06-11 15:26:58 +02:00
f2ec3b239b remove enforcement of non special when adding tokens (#1521)
* remove enforcement of non special when adding tokens

* mut no longer needed

* add a small test

* nit

* style

* audit

* ignore cargo audit's own vulnerability

* update

* revert

* remove CVE
2024-04-30 15:53:47 +02:00
91393ef75e Fixing doc. (#1499)
* Fixing doc.

* SentencePieceUnigram  and Convert.py still used sentencepiece

* stub

---------

Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
2024-04-17 09:32:40 +02:00
09069717e9 Refactor metaspace (#1476)
* version = "0.15.3-dev-0”

Improve performances of meta space, but also just fix it.

(transformers) ➜  transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
[0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848]
(transformers) ➜  transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
[0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]

* well what do we have

* nit

* be BC with non legacy

* unrelated change for clippy

* fix test

* splitting is a must for word_ids

* fmt and lint

* Fixing everything (hopefully better).

* Fixing node.

* Including yarn.lock

* Lint.

* Stubs.

* revert to use split

* fix merge issues

* fix tests

* finish fixing tests

* ruff

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-03-30 10:27:24 +01:00
29fef1e7aa [remove black] And use ruff (#1436)
* nits

* Fixing deps.

* Ruff update.

* Import order matters.

* Fix.

* Revert ruff fix.

* Visualizer.

* Putting back the imports.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-03-12 11:24:21 +01:00
6a77d4859b Encode special tokens (#1437)
* add doc in the code

* add option to skip special tokens

* nits

* add api dummy for now

* Fmt.

* Fix fmt.

* Fix the stub.

* add a test

* add a test in python

* style it

* nits

* add getter and setters

* stub

* update python test

* fmt

* last nit

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-01-19 12:43:43 +01:00
11462596d1 Faster HF dataset iteration in docs (#1414)
* Faster HF dataset iteration in docs

* Nit
2023-12-14 16:12:56 +01:00
f55822baea [pre_tokenizers] Fix sentencepiece based Metaspace (#1357)
* nits

* allow for legacy beahaviour without making any breaking changes

* add a todo

* set to legacy by default

* skip legacy serialization

* push correct update

* lint

* add deserialization test

* add a python test as well

* updates

* fix serialization tests

* nits

* python stylijng of the tests

* better tests

* fix offsets

* fix imports

* fmt

* update metaspace

* remove TODO

* use enm

* fix some tses

* nits

* use enum

* update tests

* syling

* remove impl from for PrependScheme

* use simple getters and setters

* lint

* update tests

* add test new == new_with_prepend_scheme

* revert a change

* use setters and getterts

* Update bindings/python/src/pre_tokenizers.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* nits

* use copy rather than ref

* nits format

* more nits

* allow option string

* enforce First Never Always camel cased

* nits

* refactor

* update test as well

* fmt

* nits

* properly error out

* Update bindings/python/src/pre_tokenizers.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* suggestion changes

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-11-14 18:05:07 +01:00
26fdfc2bc3 style 2023-09-05 16:42:45 +00:00
08af8ea9c3 make tests happy 2023-09-05 15:37:09 +00:00
f1da83f358 add support for get_added_tokens_decoder 2023-09-05 14:49:29 +00:00
93b37f36dc styling 2023-09-04 20:54:55 +00:00
058e34b421 make special editable as well 2023-09-04 20:54:29 +00:00
d4008b0d7a cliipy 2023-09-04 19:11:05 +00:00
b117ac7f16 updates 2023-09-04 19:10:22 +00:00
a53dff9bc5 make content writable in python 2023-09-04 18:18:21 +00:00
d9829cdc6e fix more tests 2023-09-04 17:22:27 +00:00
864135bef1 Add unigram bytefallback (#1217)
* current updates will go red

* cargo fmt

* npm install

* refactor train for unigram to allow bytefallbakc (breaking)

* fmt

* nits

* update

* add a proper test

* fix encode optimised fallback + add trainer arg

* fixes

* fixes

* fix tests

* add test

* fmt

* fix rust test

* update python bindings

* update

* pub is okay and needed

* more fix

* cleanup

* remove useles id

* MissingUnkId error

* nits

* fix offset

* add a test in python

* update src bindings

* remove bytefallback from trainer

* styling

* update pckg

* lint

* fmt

* stup with dev

* update code based on review

* remove unused function

* udpate python test to compare ids

* fix option bool issues

* final fix

* clippy

* fix npm isntall

* update

* update test

* more in depth testing

* Lint

* last attempt to fix node

* update node bindings

* fmt

* Update tokenizers/src/models/unigram/model.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* update based on review

* simpler test

* lint

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-06-26 10:46:59 +02:00
cefc41e8ec implement a simple max_sentencepiece_length into BPE (#1228)
* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* utilize Option<u16> for safer code.

* Other version.

* Update trainer.rs

clarify with type usize propagate max_length option

* change max_length into more descriptive name

in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.

* change variable name in trainer.rs

change max_merge_length into max_token_length

* Update trainer.rs

add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer

Add explanation for variable shadowing.

* Update trainer.rs

Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.

* add max_token_length test

* Add bpe direct assert test

* Update trainer.rs

clarified test documentation

* Creating the bindings.

* Fix the default.

* Re-adding missing package-lock which I accidentally removed.

* ..

* Fixing trainer test.

* Fix.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-16 10:08:19 +02:00
3aaf4946b3 Add content to Strip decoder to allow decoding mid tokens. (#1199)
* Add `content` to Strip decoder to allow decoding mid tokens.

* Stub.

* Clippy.
2023-03-24 10:14:49 +01:00
e4aea890d5 Adding 2 new decoders: (#1196)
* Adding 2 new decoders:

- Fuse will simply concatenate all tokens into 1 string
- Strip will remove n char from left or right

Sequence(Replace("_", " "), Fuse(), Strip(1, 0)) should be what we want
for the `Metaspace` thing.

- Note: Added a new dependency from better parsing of decoders.
This is due to untagged enums which can match anything the `MustBe`
ensure there's no issue between Fuse and ByteFallback.
Since both are new the chances for backward incompatibility is low.

* Fixing picking/unpickling (using default args.).

* Stub.

* Black.

* Fixing node.
2023-03-24 00:50:54 +01:00
d2c8190a0f Creating normalizers.Prepend (To be used instead of Metaspace). (#1194)
* Creating `normalizers.Prepend` (To be used instead of `Metaspace`).

* Linting + stub.

* Fixing pickling/unpickling by setting a default.

* Black.
2023-03-24 00:33:31 +01:00
250d46c676 Adding Replace to decoder (to undo the Replace Normalizer for (#1195)
Metaspace split).
2023-03-23 23:43:47 +01:00
73637a0004 Adding ByteFallback support for tokenizers. (#1183)
* Adding ByteFallback support for `tokenizers`.

Two items added:

- A flag `byte_fallback` for the `BPE` model. This will be in charge
  of using `<0x61>` instead of unk on unknown tokens.
- A ByteFallback decoder, which will be in charge of putting everything
  back into string whenever possible. Showing � when the byte decoding
  fails (behavior checked against LlamaTokenizer in `transformers`.

* Update rustdoc.

* Clippy + Add BPE(byte_fallback) into bindings.

* Stupid file.

* Test artifacts removed.

* Update stub.

* Fix.

* Bad file.

* CRITICAL FIX: wrapper order because of untagged....

* Remove prints.

* Fixing <16 byte fallback.
2023-03-23 16:04:32 +01:00
5c18ec5ff5 pyo3 v0.18 migration (#1173)
* pyo v0.18 migration

* Fix formatting issues of black
2023-03-08 11:27:47 +01:00
6113666624 Updating python formatting. (#1079)
* Updating python formatting.

* Forgot gh action.

* Skipping isort to prevent circular imports.

* Updating stub.

* Removing `isort` (it contradicts `stub.py`).

* Fixing weird stub black/isort disagreeement.
2022-10-05 15:29:33 +02:00
06025e4ca1 Adding Sequence for PostProcessor. (#1052)
* Adding `Sequence` for `PostProcessor`.

* Fixing node? Writing in the dark here, don't have Python2.7

* `undefined` is not accepted.

* Other test.
2022-08-25 14:50:06 +02:00
943b5421aa Changing Decoder trait to be more composable. (#938) (#1008)
* Changing `Decoder` trait to be more composable. (#938)

* Changing `Decoder` trait to be more composable.

Fix #872

* Fixing Python side.

* Fixing test.

* Updating cleanup signature, removing turbofish.

* Adding `Sequence` Decoder.
2022-06-02 14:43:42 +02:00
ec43947786 Revert "Changing Decoder trait to be more composable. (#938)" (#971)
This reverts commit cdabef14c4.
2022-04-04 09:43:28 +02:00
a5f644616b Fix the error test for Python 3.10 (error message is different). (#962) 2022-03-23 10:35:58 +01:00
1bb9884f45 Fixing the vocab size of the trained Unigram model (#952)
* Fixing the vocab size of the trained Unigram model

* add test for the vocab size of the trained Unigram model

* Revert "add test for the vocab size of the trained Unigram model"

This reverts commit fb8955c831b357d1037548ceaa8789734d544646.

* Fixing the vocab size of the trained Unigram model

* format codes

* get the position of vocab-size calculation out of loop
2022-03-18 18:13:17 +01:00
cdabef14c4 Changing Decoder trait to be more composable. (#938)
* Changing `Decoder` trait to be more composable.

Fix #872

* Fixing Python side.

* Fixing test.

* Updating cleanup signature, removing turbofish.
2022-03-17 10:32:09 +01:00
4b6055d4fb Adding pickling support for trainers (#949)
* TMP.

* Adding support for pickling Python trainers.

* Remove not warranted files + missed naming updates.

* Stubbing.

* Making sure serialized format is written in python tests.
2022-03-14 12:18:11 +01:00
1a84958cc8 Fixing bad deserialization following inclusion of a default for Punctuation. (#884)
* Fixing bad deserialization following inclusion of a default for
`Punctuation`.

* don't remove the type now...

* Adding slow test to run on all the tokenizers of the hub.

* `PartialEq` everywhere.

* Forcing `type` to exist on the `pre_tokenizers`.
2022-01-17 22:28:25 +01:00
1054e243e2 Fix invalid continuing subwrd prefix. (#864)
* Creating failing test for invalid continuing subwrd prefix.

* Test in rust + the associated fix.

* Clippy.

* Black.
2022-01-04 14:25:35 +01:00
152880ab3e Adding truncation_side within TruncationParams. (#860)
* Add truncation to enable_truncation

* Fix typo

* Adding truncation_side within `TruncationParams`.

* Node serialization of this direction param.

* Update the test.

* Fixing warnings/lint.

* Adding stuff (can't local debug :( )

* Slow loop... ;(

* Stub.py.

Co-authored-by: Niels Rogge <niels.rogge1@gmail.com>
2021-12-28 12:37:06 +01:00
c4c9de23a5 Feature: Handle invalid truncate direction (#858)
* refacto: TruncateDirection -> TruncationDirection

* feat(node): invalid direction will throw

* feat(python): invalid direction will throw

* Update bindings/node/lib/bindings/raw-encoding.test.ts

* Update bindings/python/tests/bindings/test_encoding.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2021-12-27 14:31:57 +01:00
c1100ec542 Clippy fixes. (#846)
* Clippy fixes.

* Drop support for Python 3.6

* Remove other 3.6

* Re-enabling caches for build (5h + seems too long and issue seems
solved)

https://github.com/actions/virtual-environments/issues/572

* `npm audit fix`.

* Fix yaml ?

* Pyarrow issue fixed: https://github.com/huggingface/datasets/pull/2268

* Installing dev libraries.

* Install python dev elsewhere ?

* Typo.

* No sudo.

* ...

* Testing the GH again.

* Maybe v2 will fix ?

* Fixing tests on MacOS Python 3.8+
2021-12-15 15:55:48 +01:00
35c96e5e3f Add tests for from_pretrained 2021-08-31 09:00:05 -04:00
e2bf8daa3a Add SplitDelimiterBehavior to Punctuation constructor (#657)
Resolves: #642
2021-08-13 09:19:23 -04:00
da4c7b10e4 Add a way to specify the unknown token in SentencePieceUnigramTokenizer python implem (#762)
* add a way to specify the unknown token in `SentencePieceUnigramTokenizer`

* add test that verify that an exception is raised for the missing unknown token

* style

* add test tokens
2021-08-12 09:42:44 -04:00
2e2e7558f7 Add CTC Decoder for Wave2Vec models (#693)
* Rust - add a CTCDecoder as a seperate mod

* Adding bindings to Node + Python.

* Clippy update.

* Stub.

* Fixing roberta.json URLs.

* Moving test files to hf.co.

* Update cargo check and clippy to 1.52.

* Inner ':' actually is used for domains in sphinx.

Making `domain` work correctly was just too much work so I went the easy
way and have global roles for the custom rust extension.

* Update struct naming and docs

* Update changelog

Co-authored-by: Thomaub <github.thomaub@gmail.com>
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-05-20 09:30:09 -04:00
57200144ca Python - Fix ByteLevel instantiation from state (#621) 2021-02-04 10:16:05 -05:00
6a29dbc070 Doc - Hotfix training from iterators tutorial 2021-02-03 15:50:09 -05:00