Commit Graph

1866 Commits

Author SHA1 Message Date
2204066e78 Bump body-parser and express in /tokenizers/examples/unstable_wasm/www (#1629)
Bumps [body-parser](https://github.com/expressjs/body-parser) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together.

Updates `body-parser` from 1.20.0 to 1.20.3
- [Release notes](https://github.com/expressjs/body-parser/releases)
- [Changelog](https://github.com/expressjs/body-parser/blob/master/HISTORY.md)
- [Commits](https://github.com/expressjs/body-parser/compare/1.20.0...1.20.3)

Updates `express` from 4.18.1 to 4.21.0
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/4.21.0/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.18.1...4.21.0)

---
updated-dependencies:
- dependency-name: body-parser
  dependency-type: indirect
- dependency-name: express
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-01 14:16:41 +02:00
3fb1371c1c [ignore_merges] Fix offsets (#1640)
* Fix the default offset create

* update the tests

* clippy
2024-10-01 09:22:20 +02:00
b4a38c4f63 Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows (#1626)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 3 to 4.1.7.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v3...v4.1.7)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-30 16:38:28 +02:00
14a07b06e4 fix filelink (#1610) 2024-08-12 07:35:33 +02:00
75aef5b75b Update README.md (#1608) 2024-08-09 10:40:21 +02:00
81c471cf17 update dev version 0.20.0 2024-08-08 18:11:50 +02:00
85cc05a32f Fix CI (#1607) 2024-08-08 17:09:30 +02:00
bfd9cdeefb Perf improvement 16% by removing offsets. (#1587)
* [Breaking Change] Perf improvement 16% by removing offsets.

Offsets calculation are always calculated in Python land.
By changing it to not being calculated, we win 16% of the runtime.

This is not the total extent of it because offsets are
still calculated in bytes.

* Required features.

* Remove clippy error.

* Make it non breaking and still show perf improvement.

* Even faster without offsets.

* Update doc.

* Fmt.

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fmt.

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2024-08-08 14:56:13 +02:00
bd27fa56d6 add deserialize for pre tokenizers (#1603)
* add deserialize

* copy from the decoder

* fmt

* clippy

* fix rust tests

* fmt

* don't change the test
2024-08-08 08:38:09 +02:00
56c9c70440 Tests + Deserialization improvement for normalizers. (#1604) 2024-08-08 08:38:02 +02:00
49dafd707e Fix strip python type (#1602)
* update

* the fix

* Revert "update"

This reverts commit 4c2f32f116479b0ec8ccd7c832f86cbc8787d8a9.

* add a test and rebase

* style

* oups
2024-08-07 15:36:28 +02:00
bded212356 Support None to reset pre_tokenizers and normalizers, and index sequences (#1590)
* initial commit

* support None

* fix clippy

* cleanup

* clean?

* propagate to pre_tokenizer

* fix test

* fix rust tests

* fix node

* propagate to decoder and post processor

* fix calls

* lint

* fmt

* node be happy I am fixing you

* initial commit

* support None

* fix clippy

* cleanup

* clean?

* propagate to pre_tokenizer

* fix test

* fix rust tests

* fix node

* propagate to decoder and post processor

* fix calls

* lint

* fmt

* node be happy I am fixing you

* add a small test

* styling

* style merge

* fix merge test

* fmt

* nits

* update tset
2024-08-07 12:52:35 +02:00
eea8e1ae6f Fix doc about split (#1591)
* update doc

* add example

* Update bindings/python/src/pre_tokenizers.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* stub

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-08-07 12:35:01 +02:00
6a5fce9fa0 Merges cannot handle tokens containing spaces. (#909)
* Merges cannot handle tokens containing spaces.

This fixes this while keeping backward support.
We don't want to merge that blindly.

* Update the tests.

* Fixing clippy.

* Add a test with spaces in the token/merge.
2024-08-07 12:34:53 +02:00
ab9c7ded8b Using serde (serde_pyo3) to get __str__ and __repr__ easily. (#1588)
* Using serde (serde_pyo3) to get __str__ and __repr__ easily.

* Putting it within tokenizers, it needs to be too specific.

* Clippy is our friend.

* Ruff.

* Update the tests.

* Pretty sure this is wrong (#1589)

* Adding support for ellipsis.

* Fmt.

* Ruff.

* Fixing tokenizer.

---------

Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
2024-08-07 12:08:29 +02:00
7a30bca2f3 Updating error messages. (#1599) 2024-08-06 16:42:56 +02:00
8f2cc90249 Add test normalizers (#1600)
* update

* update test they passs

* fmt
2024-08-06 16:08:18 +02:00
fe41687ca8 Better serialization error (#1595)
* Updating the deserialization error for models.

* Update tokenizers/src/models/mod.rs

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2024-08-06 13:39:11 +02:00
2d27761f60 Adding a few tests for decoder deserialization. 2024-08-06 13:36:36 +02:00
adc82cb49a Add-legacy-tests (#1597)
* add tests

* decoder as well

* check error

* propagate

* lint

* rafiune the test

* lint

* revert decoder changes

* on more?

* fmt

* Update tokenizers/src/pre_tokenizers/mod.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* fix commit

* simplify err

* fmt

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-08-06 13:08:12 +02:00
99a48dcb46 Clippy. 2024-08-06 10:48:39 +02:00
5fb8a2320c Legacy test. 2024-08-06 10:48:39 +02:00
388014fd6b Adding some serialization testing around the wrapper. 2024-08-06 10:48:39 +02:00
7b80359dd2 Fixing release CI strict (taken from safetensors). 2024-08-06 09:11:30 +02:00
a010f6b75c Revert "Using serde (serde_pyo3) to get __str__ and __repr__ easily."
This reverts commit 86138337fc.
2024-08-02 18:42:57 +02:00
86138337fc Using serde (serde_pyo3) to get __str__ and __repr__ easily. 2024-08-02 18:41:54 +02:00
7415e28536 Enabling the option to use fancy_regex instead of onig. 2024-08-01 15:53:17 +02:00
9e0c791f2b Small performance fixup (negligible but obviously better). 2024-08-01 15:52:39 +02:00
1df498a186 Fixing benchmark2. 2024-08-01 15:52:39 +02:00
c6f2c0b057 Fixing the benchmark. (#1583) 2024-08-01 10:36:53 +02:00
35f338a7b8 Add benchmark vs tiktoken (#1582)
* Adding a simple tiktoken benchmark.

* Adding 1 large fused document case.
2024-07-31 17:09:23 +02:00
aface7a968 dump spm_precompiled to 0.1.3 (#1571) 2024-07-31 15:38:04 +02:00
a3ad85b3e8 Fix clippy + feature test management. (#1580)
* Fix clippy + feature test management.

* That example was local oops.

* CLippy fix.

* Readme indentation.

* README update.
2024-07-26 12:16:30 +02:00
4ea2f235b0 Add bytelevel normalizer to fix decode when adding tokens to BPE (#1555)
* feature dependent test

* nit about 嗎

* update

* actuallyfix it

* update the test

add it

fix

* stub

* Update tokenizers/src/pre_tokenizers/byte_level.rs

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* skip failing test

* add normalizer to init

---------

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
2024-07-15 12:12:03 +02:00
f2a44dc5d1 Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … (#1569)
* Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)"

This reverts commit 25aee8b88c.

* don't remove audit

* deprecate id_to_token

* use simple id to token

* don't break id_to_token since we are deprecating anyways?
2024-07-12 07:29:40 +02:00
fdd26ba9a3 Enable dropout = 0.0 as an equivalent to none in BPE (#1550)
* enable dropout = 0.0

* typo

* lint

* formatter

* enable dropout = 0.0

* formatter
2024-06-24 12:36:11 +02:00
9441f7e8f7 make sure we don't warn on empty tokens (#1554)
* make sure we don't warn on empty tokens

* Testing the log is actually hard 😓

* mpty
2024-06-20 14:33:21 +02:00
3e736bbccb Fix clippy 2024-06-20 09:39:19 +02:00
1ff56c0c70 Fix 'dictionnary' typo (#1511) 2024-06-11 15:43:47 +02:00
88f51fe7d2 Switch from cached_download to hf_hub_download in tests (#1547) 2024-06-11 15:26:58 +02:00
418c35c09e feat(ci): add trufflehog secrets detection (#1551)
* feat(ci): add trufflehog secrets detection

* fix(ci): remove unnecessary permissions
2024-06-10 16:10:23 +02:00
8d28dbefd1 Fixing for clippy 1.78 (#1548) 2024-06-06 13:18:59 +02:00
bfefcf676d Make USED_PARALLELISM atomic (#1532) 2024-06-06 13:02:26 +02:00
25aee8b88c [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)
* [BREAKING CHANGE] Ignore added_tokens (both special and not) in the
decoder

Causes issues with `ByteLevel` messing up some `AddedTokens` with some
utf-8 range used in the bytelevel mapping.

This commit tests the extend of the damage of ignoring the decoder for
those tokens.

* Format.

* Installing cargo audit.

* Minor fix.

* Fixing "bug" in node/python.

* Autoformat.

* Clippy.

* Only prefix space when there's no decoder.
2024-05-06 11:49:38 +02:00
f2ec3b239b remove enforcement of non special when adding tokens (#1521)
* remove enforcement of non special when adding tokens

* mut no longer needed

* add a small test

* nit

* style

* audit

* ignore cargo audit's own vulnerability

* update

* revert

* remove CVE
2024-04-30 15:53:47 +02:00
71c2a8d01a update dev version so 0.19.1 2024-04-17 23:17:12 +02:00
7733bc25d6 add serialization for ignore_merges (#1504)
* add serialization for `ignore_merges`

* add serialization tests

* deserialize without `ignore_merges`
2024-04-17 21:56:48 +02:00
91393ef75e Fixing doc. (#1499)
* Fixing doc.

* SentencePieceUnigram  and Convert.py still used sentencepiece

* stub

---------

Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
2024-04-17 09:32:40 +02:00
949d9e3e0e Bumping all versions 3 times (ty transformers :) ) (#1498) 2024-04-16 15:58:36 +02:00
e0defa7355 Remove 3.13 (potential undefined behavior.) (#1497) 2024-04-16 15:56:47 +02:00