Commit Graph

1842 Commits

Author SHA1 Message Date
a010f6b75c Revert "Using serde (serde_pyo3) to get __str__ and __repr__ easily."
This reverts commit 86138337fc.
2024-08-02 18:42:57 +02:00
86138337fc Using serde (serde_pyo3) to get __str__ and __repr__ easily. 2024-08-02 18:41:54 +02:00
7415e28536 Enabling the option to use fancy_regex instead of onig. 2024-08-01 15:53:17 +02:00
9e0c791f2b Small performance fixup (negligible but obviously better). 2024-08-01 15:52:39 +02:00
1df498a186 Fixing benchmark2. 2024-08-01 15:52:39 +02:00
c6f2c0b057 Fixing the benchmark. (#1583) 2024-08-01 10:36:53 +02:00
35f338a7b8 Add benchmark vs tiktoken (#1582)
* Adding a simple tiktoken benchmark.

* Adding 1 large fused document case.
2024-07-31 17:09:23 +02:00
aface7a968 dump spm_precompiled to 0.1.3 (#1571) 2024-07-31 15:38:04 +02:00
a3ad85b3e8 Fix clippy + feature test management. (#1580)
* Fix clippy + feature test management.

* That example was local oops.

* CLippy fix.

* Readme indentation.

* README update.
2024-07-26 12:16:30 +02:00
4ea2f235b0 Add bytelevel normalizer to fix decode when adding tokens to BPE (#1555)
* feature dependent test

* nit about 嗎

* update

* actuallyfix it

* update the test

add it

fix

* stub

* Update tokenizers/src/pre_tokenizers/byte_level.rs

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* skip failing test

* add normalizer to init

---------

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
2024-07-15 12:12:03 +02:00
f2a44dc5d1 Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … (#1569)
* Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)"

This reverts commit 25aee8b88c.

* don't remove audit

* deprecate id_to_token

* use simple id to token

* don't break id_to_token since we are deprecating anyways?
2024-07-12 07:29:40 +02:00
fdd26ba9a3 Enable dropout = 0.0 as an equivalent to none in BPE (#1550)
* enable dropout = 0.0

* typo

* lint

* formatter

* enable dropout = 0.0

* formatter
2024-06-24 12:36:11 +02:00
9441f7e8f7 make sure we don't warn on empty tokens (#1554)
* make sure we don't warn on empty tokens

* Testing the log is actually hard 😓

* mpty
2024-06-20 14:33:21 +02:00
3e736bbccb Fix clippy 2024-06-20 09:39:19 +02:00
1ff56c0c70 Fix 'dictionnary' typo (#1511) 2024-06-11 15:43:47 +02:00
88f51fe7d2 Switch from cached_download to hf_hub_download in tests (#1547) 2024-06-11 15:26:58 +02:00
418c35c09e feat(ci): add trufflehog secrets detection (#1551)
* feat(ci): add trufflehog secrets detection

* fix(ci): remove unnecessary permissions
2024-06-10 16:10:23 +02:00
8d28dbefd1 Fixing for clippy 1.78 (#1548) 2024-06-06 13:18:59 +02:00
bfefcf676d Make USED_PARALLELISM atomic (#1532) 2024-06-06 13:02:26 +02:00
25aee8b88c [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder (#1513)
* [BREAKING CHANGE] Ignore added_tokens (both special and not) in the
decoder

Causes issues with `ByteLevel` messing up some `AddedTokens` with some
utf-8 range used in the bytelevel mapping.

This commit tests the extend of the damage of ignoring the decoder for
those tokens.

* Format.

* Installing cargo audit.

* Minor fix.

* Fixing "bug" in node/python.

* Autoformat.

* Clippy.

* Only prefix space when there's no decoder.
2024-05-06 11:49:38 +02:00
f2ec3b239b remove enforcement of non special when adding tokens (#1521)
* remove enforcement of non special when adding tokens

* mut no longer needed

* add a small test

* nit

* style

* audit

* ignore cargo audit's own vulnerability

* update

* revert

* remove CVE
2024-04-30 15:53:47 +02:00
71c2a8d01a update dev version so 0.19.1 2024-04-17 23:17:12 +02:00
7733bc25d6 add serialization for ignore_merges (#1504)
* add serialization for `ignore_merges`

* add serialization tests

* deserialize without `ignore_merges`
2024-04-17 21:56:48 +02:00
91393ef75e Fixing doc. (#1499)
* Fixing doc.

* SentencePieceUnigram  and Convert.py still used sentencepiece

* stub

---------

Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
2024-04-17 09:32:40 +02:00
949d9e3e0e Bumping all versions 3 times (ty transformers :) ) (#1498) 2024-04-16 15:58:36 +02:00
e0defa7355 Remove 3.13 (potential undefined behavior.) (#1497) 2024-04-16 15:56:47 +02:00
d5a8cc7a49 PyO3 0.21. (#1494)
* PyO3 0.21.

* Upgraded everything.

* Rustfmt.
2024-04-16 13:49:52 +02:00
914576f7ed Add more support for tiktoken based tokenizers (#1493)
* first commit

* update

* clippy

* lint

* clippy and lint

* fmt

* revert print

* 😈

* style

* add a test

* more fmt

* Use ignore_merges

* stub

* fix

* update

* Update tokenizers/src/models/bpe/model.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* update

* rust lint

* dob; t repeat yourself

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-04-15 17:26:36 +02:00
6e58f838b3 version = "0.16.0-dev.0" 2024-04-02 09:51:14 +02:00
09069717e9 Refactor metaspace (#1476)
* version = "0.15.3-dev-0”

Improve performances of meta space, but also just fix it.

(transformers) ➜  transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
[0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848]
(transformers) ➜  transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
[0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]

* well what do we have

* nit

* be BC with non legacy

* unrelated change for clippy

* fix test

* splitting is a must for word_ids

* fmt and lint

* Fixing everything (hopefully better).

* Fixing node.

* Including yarn.lock

* Lint.

* Stubs.

* revert to use split

* fix merge issues

* fix tests

* finish fixing tests

* ruff

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-03-30 10:27:24 +01:00
6153126b22 Added ability to inspect a 'Sequence' decoder and the AddedVocabulary. (#1443)
* Fixes.

* Fixes.
2024-03-30 00:29:54 +01:00
d8c4388166 Bump ip from 2.0.0 to 2.0.1 in /bindings/node (#1456)
Bumps [ip](https://github.com/indutny/node-ip) from 2.0.0 to 2.0.1.
- [Commits](https://github.com/indutny/node-ip/compare/v2.0.0...v2.0.1)

---
updated-dependencies:
- dependency-name: ip
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-25 11:29:36 +01:00
29fef1e7aa [remove black] And use ruff (#1436)
* nits

* Fixing deps.

* Ruff update.

* Import order matters.

* Fix.

* Revert ruff fix.

* Visualizer.

* Putting back the imports.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-03-12 11:24:21 +01:00
72a1973cd1 chore: Remove CLI - this was originally intended for local development (#1442) 2024-02-13 04:05:43 +01:00
7f49f20ab0 version = "0.15.3-dev-0” 2024-02-12 09:48:00 +09:00
c893204c45 Efficient Replace normalizer (#1413)
* new Replace work

* clean up

* clean up

* typo

* cargo fmt

* Clippy.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-02-06 14:36:44 +01:00
4a8105c366 Convert word counts to u64 (#1433)
* Convert word counts to u64

* More spots needed to compile
2024-02-06 03:39:12 +01:00
67fe59c88d chore: Update dependencies to latest supported versions (#1441) 2024-01-22 17:54:37 +01:00
8f73fe9515 update dev version to 0.15.2-dev.0 2024-01-22 15:34:57 +01:00
accd0650b8 Update release for python3.12 windows (#1438)
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-01-19 15:56:47 +01:00
6a77d4859b Encode special tokens (#1437)
* add doc in the code

* add option to skip special tokens

* nits

* add api dummy for now

* Fmt.

* Fix fmt.

* Fix the stub.

* add a test

* add a test in python

* style it

* nits

* add getter and setters

* stub

* update python test

* fmt

* last nit

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-01-19 12:43:43 +01:00
888dd4bc65 pyo3: update to 0.20 (#1386)
Co-authored-by: Mike Lui <mikelui@meta.com>
2024-01-11 17:03:13 +01:00
8939d4e26d Bump follow-redirects in /tokenizers/examples/unstable_wasm/www (#1430)
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.15.1 to 1.15.4.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.15.1...v1.15.4)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-10 12:04:48 +01:00
43b31a83c7 Fix make bench. (#1428) 2024-01-08 09:53:51 +01:00
f1c23b8680 Add quick doc to byte_level.rs (#1420)
* Add quick doc to byte_level.rs

* Address PR comments
2024-01-03 10:25:07 +01:00
11462596d1 Faster HF dataset iteration in docs (#1414)
* Faster HF dataset iteration in docs

* Nit
2023-12-14 16:12:56 +01:00
8edec536a7 Fix doc links in readme (#1367)
* Fix doc links in readme

* even better?
2023-12-09 12:14:54 +01:00
8f9b945c75 Stale bot. (#1404) 2023-12-05 14:11:37 +01:00
daf361676b Derive Clone on Tokenizer, add Encoding.into_tokens() method (#1381)
* Add `into_tokens()` method

* derive clone

* Update tokenizers/src/tokenizer/encoding.rs

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-11-20 09:56:29 +01:00
e3bcef288b udpate to version = "0.15.1-dev0" (#1390)
* Apply suggestions from code review
2023-11-15 13:30:58 +01:00