tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
Nicolas Patry	d5a8cc7a49	PyO3 0.21. (#1494 ) * PyO3 0.21. * Upgraded everything. * Rustfmt.	2024-04-16 13:49:52 +02:00
Arthur	914576f7ed	Add more support for tiktoken based tokenizers (#1493 ) * first commit * update * clippy * lint * clippy and lint * fmt * revert print * 😈 * style * add a test * more fmt * Use ignore_merges * stub * fix * update * Update tokenizers/src/models/bpe/model.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * update * rust lint * dob; t repeat yourself --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-04-15 17:26:36 +02:00
Arthur Zucker	6e58f838b3	version = "0.16.0-dev.0"	2024-04-02 09:51:14 +02:00
Arthur	09069717e9	Refactor metaspace (#1476 ) * version = "0.15.3-dev-0” Improve performances of meta space, but also just fix it. (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] ['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] [0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848] (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] ['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] [0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416] * well what do we have * nit * be BC with non legacy * unrelated change for clippy * fix test * splitting is a must for word_ids * fmt and lint * Fixing everything (hopefully better). * Fixing node. * Including yarn.lock * Lint. * Stubs. * revert to use split * fix merge issues * fix tests * finish fixing tests * ruff --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-03-30 10:27:24 +01:00
Anthony Platanios	6153126b22	Added ability to inspect a 'Sequence' decoder and the `AddedVocabulary`. (#1443 ) * Fixes. * Fixes.	2024-03-30 00:29:54 +01:00
dependabot[bot]	d8c4388166	Bump ip from 2.0.0 to 2.0.1 in /bindings/node (#1456 ) Bumps [ip](https://github.com/indutny/node-ip) from 2.0.0 to 2.0.1. - [Commits](https://github.com/indutny/node-ip/compare/v2.0.0...v2.0.1) --- updated-dependencies: - dependency-name: ip dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-03-25 11:29:36 +01:00
Arthur	29fef1e7aa	[`remove black`] And use ruff (#1436 ) * nits * Fixing deps. * Ruff update. * Import order matters. * Fix. * Revert ruff fix. * Visualizer. * Putting back the imports. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-03-12 11:24:21 +01:00
Bryant Biggs	72a1973cd1	chore: Remove CLI - this was originally intended for local development (#1442 )	2024-02-13 04:05:43 +01:00
Arthur Zucker	7f49f20ab0	version = "0.15.3-dev-0”	2024-02-12 09:48:00 +09:00
Rasmus Larsen	c893204c45	Efficient Replace normalizer (#1413 ) * new Replace work * clean up * clean up * typo * cargo fmt * Clippy. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-02-06 14:36:44 +01:00
Stephen Roller	4a8105c366	Convert word counts to u64 (#1433 ) * Convert word counts to u64 * More spots needed to compile	2024-02-06 03:39:12 +01:00
Bryant Biggs	67fe59c88d	chore: Update dependencies to latest supported versions (#1441 )	2024-01-22 17:54:37 +01:00
Arthur Zucker	8f73fe9515	update dev version to 0.15.2-dev.0	2024-01-22 15:34:57 +01:00
Arthur	accd0650b8	Update release for python3.12 windows (#1438 ) Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-01-19 15:56:47 +01:00
Arthur	6a77d4859b	Encode special tokens (#1437 ) * add doc in the code * add option to skip special tokens * nits * add api dummy for now * Fmt. * Fix fmt. * Fix the stub. * add a test * add a test in python * style it * nits * add getter and setters * stub * update python test * fmt * last nit --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-01-19 12:43:43 +01:00
Michael Lui	888dd4bc65	pyo3: update to 0.20 (#1386 ) Co-authored-by: Mike Lui <mikelui@meta.com>	2024-01-11 17:03:13 +01:00
dependabot[bot]	8939d4e26d	Bump follow-redirects in /tokenizers/examples/unstable_wasm/www (#1430 ) Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.15.1 to 1.15.4. - [Release notes](https://github.com/follow-redirects/follow-redirects/releases) - [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.15.1...v1.15.4) --- updated-dependencies: - dependency-name: follow-redirects dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-01-10 12:04:48 +01:00
Nicolas Patry	43b31a83c7	Fix make bench. (#1428 )	2024-01-08 09:53:51 +01:00
Steven Weiss	f1c23b8680	Add quick doc to byte_level.rs (#1420 ) * Add quick doc to byte_level.rs * Address PR comments	2024-01-03 10:25:07 +01:00
Mario Šaško	11462596d1	Faster HF dataset iteration in docs (#1414 ) * Faster HF dataset iteration in docs * Nit	2023-12-14 16:12:56 +01:00
Pierric Cistac	8edec536a7	Fix doc links in readme (#1367 ) * Fix doc links in readme * even better?	2023-12-09 12:14:54 +01:00
Nicolas Patry	8f9b945c75	Stale bot. (#1404 )	2023-12-05 14:11:37 +01:00
Pete	daf361676b	Derive `Clone` on `Tokenizer`, add `Encoding.into_tokens()` method (#1381 ) * Add `into_tokens()` method * derive clone * Update tokenizers/src/tokenizer/encoding.rs --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-11-20 09:56:29 +01:00
Arthur	e3bcef288b	udpate to version = "0.15.1-dev0" (#1390 ) * Apply suggestions from code review	2023-11-15 13:30:58 +01:00
Arthur	f55822baea	[`pre_tokenizers`] Fix sentencepiece based Metaspace (#1357 ) * nits * allow for legacy beahaviour without making any breaking changes * add a todo * set to legacy by default * skip legacy serialization * push correct update * lint * add deserialization test * add a python test as well * updates * fix serialization tests * nits * python stylijng of the tests * better tests * fix offsets * fix imports * fmt * update metaspace * remove TODO * use enm * fix some tses * nits * use enum * update tests * syling * remove impl from for PrependScheme * use simple getters and setters * lint * update tests * add test new == new_with_prepend_scheme * revert a change * use setters and getterts * Update bindings/python/src/pre_tokenizers.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * nits * use copy rather than ref * nits format * more nits * allow option string * enforce First Never Always camel cased * nits * refactor * update test as well * fmt * nits * properly error out * Update bindings/python/src/pre_tokenizers.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * suggestion changes --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-11-14 18:05:07 +01:00
Lucain	ee2af9e99a	Allow huggingface_hub<1.0 (#1385 )	2023-11-10 13:51:07 +01:00
Mario Šaško	648b33a09e	Allow hf_hub 0.18 (#1383 )	2023-11-06 14:12:05 +01:00
dependabot[bot]	c718c53bb9	Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node (#1370 ) Bumps [@babel/traverse](https://github.com/babel/babel/tree/HEAD/packages/babel-traverse) from 7.22.11 to 7.23.2. - [Release notes](https://github.com/babel/babel/releases) - [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md) - [Commits](https://github.com/babel/babel/commits/v7.23.2/packages/babel-traverse) --- updated-dependencies: - dependency-name: "@babel/traverse" dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-10-25 08:14:32 +02:00
Remy	985d49ae64	fix: remove useless token (#1371 )	2023-10-19 14:29:01 +02:00
天地	0d8c57da48	fix a clerical error in the comment (#1356 )	2023-10-10 21:31:44 +02:00
Nicolas Patry	4322056e6e	Preparing release. (#1355 ) * Preparing release. * Fix new clippy	2023-10-06 12:56:36 +02:00
Nicolas Patry	aed491df8c	Fixing the progressbar. (#1353 ) * Fixing the progressbar. * Upgrade deps. * Update cargo audit * Ssh this action. * Fixing esaxx by using slower rust version. * Trying the new esaxx version. * Publish. * Get cache again.	2023-10-05 15:33:58 +02:00
Arthur	7e8e69a22c	Let's allow hf_hub < 1.0 (#1344 ) * Let's allow hf_hub < 1.0 * Update bindings/python/pyproject.toml	2023-10-02 14:30:10 +02:00
Anthony Platanios	18bd5e8f9d	Added ability to inspect a 'Sequence' pre-tokenizer. (#1341 ) * Added ability to inspect a 'Sequence' pre-tokenizer. * Added ability to inspect a 'Sequence' pre-tokenizer. * Added ability to inspect a 'Sequence' pre-tokenizer. * Linting error. * Fix. * Revert rename,	2023-09-21 08:10:16 +02:00
Arthur	2c565e42c7	update package version for dev (#1339 )	2023-09-07 16:19:24 +02:00
Arthur	3dce63f062	Merge pull request #1335 from ArthurZucker/update-added-tokens Update added tokens	2023-09-07 12:48:54 +02:00
Arthur Zucker	efec086f35	`get_added_tokens_decoder` returns BTREEMap	2023-09-06 12:24:30 +00:00
Arthur Zucker	a7ace4480d	`python stub.py`	2023-09-05 17:33:14 +00:00
Arthur Zucker	f435af8b71	linting	2023-09-05 16:43:06 +00:00
Arthur Zucker	26fdfc2bc3	style	2023-09-05 16:42:45 +00:00
Arthur Zucker	b57e1c3f5d	#[allow(dead_code)] // Suppress the "method is never used" warning	2023-09-05 16:42:22 +00:00
Arthur Zucker	c3fa75fa0e	nits	2023-09-05 15:40:13 +00:00
Arthur Zucker	08af8ea9c3	make tests happy	2023-09-05 15:37:09 +00:00
Arthur Zucker	531b06f6db	update the `get_vocab_size` to compute actual length of the `get_vocab` function	2023-09-05 15:19:50 +00:00
Arthur Zucker	f1da83f358	add support for `get_added_tokens_decoder`	2023-09-05 14:49:29 +00:00
Arthur Zucker	e5fc051ad2	update	2023-09-05 13:34:43 +00:00
Arthur Zucker	93b37f36dc	styling	2023-09-04 20:54:55 +00:00
Arthur Zucker	058e34b421	make special editable as well	2023-09-04 20:54:29 +00:00
Arthur Zucker	2291c89896	`python stub.py`	2023-09-04 19:49:36 +00:00
Arthur Zucker	b235f85527	clippy	2023-09-04 19:31:48 +00:00

1 2 3 4 5 ...

1866 Commits