tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
sftse	6ea758872d	Unsound call of `set_var` (#1664 ) * refactor: lift cloning to caller * refactor: do not elide lifetimes as in Rust 2018 * fix: unsound use of env::set_var, was breaking stdlib change to make unsafe It is generally not safe to set env variables. The correct way to set a config value that needs to be overridden is to hold a copy internal to the library and only read from the environment.	2024-10-25 15:44:30 +02:00
Arthur	49dafd707e	Fix strip python type (#1602 ) * update * the fix * Revert "update" This reverts commit 4c2f32f116479b0ec8ccd7c832f86cbc8787d8a9. * add a test and rebase * style * oups	2024-08-07 15:36:28 +02:00
Arthur	bded212356	Support `None` to reset pre_tokenizers and normalizers, and index sequences (#1590 ) * initial commit * support None * fix clippy * cleanup * clean? * propagate to pre_tokenizer * fix test * fix rust tests * fix node * propagate to decoder and post processor * fix calls * lint * fmt * node be happy I am fixing you * initial commit * support None * fix clippy * cleanup * clean? * propagate to pre_tokenizer * fix test * fix rust tests * fix node * propagate to decoder and post processor * fix calls * lint * fmt * node be happy I am fixing you * add a small test * styling * style merge * fix merge test * fmt * nits * update tset	2024-08-07 12:52:35 +02:00
Nicolas Patry	ab9c7ded8b	Using serde (serde_pyo3) to get __str__ and __repr__ easily. (#1588 ) * Using serde (serde_pyo3) to get __str__ and __repr__ easily. * Putting it within tokenizers, it needs to be too specific. * Clippy is our friend. * Ruff. * Update the tests. * Pretty sure this is wrong (#1589) * Adding support for ellipsis. * Fmt. * Ruff. * Fixing tokenizer. --------- Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>	2024-08-07 12:08:29 +02:00
Nicolas Patry	a010f6b75c	Revert "Using serde (serde_pyo3) to get __str__ and __repr__ easily." This reverts commit `86138337fc`.	2024-08-02 18:42:57 +02:00
Nicolas Patry	86138337fc	Using serde (serde_pyo3) to get __str__ and __repr__ easily.	2024-08-02 18:41:54 +02:00
Arthur	4ea2f235b0	Add bytelevel normalizer to fix decode when adding tokens to BPE (#1555 ) * feature dependent test * nit about 嗎 * update * actuallyfix it * update the test add it fix * stub * Update tokenizers/src/pre_tokenizers/byte_level.rs Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> * skip failing test * add normalizer to init --------- Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>	2024-07-15 12:12:03 +02:00
Nicolas Patry	d5a8cc7a49	PyO3 0.21. (#1494 ) * PyO3 0.21. * Upgraded everything. * Rustfmt.	2024-04-16 13:49:52 +02:00
Michael Lui	540bf2eb01	pyo3: update to 0.19 (#1322 ) * Bump pyo3 dependency versions * Fix deprecation warnings from pyo3 --------- Co-authored-by: Mike Lui <mikelui@meta.com>	2023-08-16 18:40:32 +02:00
Nicolas Patry	d2c8190a0f	Creating `normalizers.Prepend` (To be used instead of `Metaspace`). (#1194 ) * Creating `normalizers.Prepend` (To be used instead of `Metaspace`). * Linting + stub. * Fixing pickling/unpickling by setting a default. * Black.	2023-03-24 00:33:31 +01:00
mert-kurttutan	5c18ec5ff5	pyo3 v0.18 migration (#1173 ) * pyo v0.18 migration * Fix formatting issues of black	2023-03-08 11:27:47 +01:00
Cameron	11bb2e00f2	Add python 3.11 to manylinux buildwheels (#1096 ) * Add python 3.11 to manylinux buildwheels * Fixing clippy. * Node clippy. * Python clippy. * Changelog + version number update. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-11-07 08:45:04 +01:00
David Hewitt	8129dd3309	pyo3: update to 0.17 (#1066 ) * python: update bindings to edition 2021 * python: update to pyo3 0.17 * Updating testing. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-10-05 16:59:01 +02:00
h-vetinari	519cc13be0	Upgrade pyo3 to 0.16 (#956 ) * Upgrade pyo3 to 0.15 Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com> * Upgrade pyo3 to 0.16 Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com> * Install Python before running cargo clippy * Fix clippy warnings * Use `PyArray_Check` instead of downcasting to `PyArray1<u8>` * Enable `auto-initialize` of pyo3 to fix `cargo test --no-default-features` * Fix some test cases Why do they change? * Refactor and add SAFETY comments to `PyArrayUnicode` Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData` Co-authored-by: messense <messense@icloud.com>	2022-05-05 15:48:40 +02:00
Nicolas Patry	256a71c1f2	Clippy 1.54. (#773 )	2021-08-11 14:43:49 +02:00
Anthony MOI	56a9196030	Fix clippy warnings	2021-03-16 12:32:06 -04:00
Anthony MOI	db22cb6315	Python - Fix Normalizer.normalize with PyNormalizedStringRefMut	2021-02-03 15:48:53 -05:00
Anthony MOI	817c5ad317	Fix clippy warnings for rust 1.49	2021-01-06 15:03:33 -05:00
Anthony MOI	5c35fafc44	Python - Decoders can get/set their attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	2feccdbbfa	Python - PyStrip can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	7512d5e4ce	Python - PyBertNormalizer can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	c22cfc31f9	Python - PyNormalizer & PyPreTokenizer use a RwLock	2020-11-27 17:35:34 -05:00
Anthony MOI	5842b3db73	Python - Improve normalizers docs	2020-11-23 11:52:51 -05:00
Nicolas Patry	352c92ad33	Automatically stubbing the `pyi` files while keeping inspecting ability (#509 ) * First pass on automatic stubbing our python files. * And now modifying all rust docs to be visible in Pyi files. * Better assert fail message. * Fixing github workflow. * Removing types not exported anymore. * Fixing `Tokenizer` signature. * Disabling auto __init__.py. * Re-enabling some types. * Don't overwrite non automated __init__.py * Automated most __init__.py * Restubbing after rebase. * Fixing env for tests. * Install blakc in the env. * Use PY35 target in stub.py Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-17 15:13:00 -05:00
Nicolas Patry	88556790e7	Fixing a bug where long tokenizer files would be incorrectly deserialized (#459 ) * Fixing a bug where long tokenizer files would be incorrectly deserialized - Add a bunch of tests to check deserialization behaviour - One tests also confirms current Single deserialization of Sequence. * Better test locations for Windows + no file dependency in Python binding Rust side. * Adressing @n1t0 comments.	2020-10-13 18:44:24 +02:00
Anthony MOI	8308508577	Python - Update bindings for Replace Normalizer	2020-09-24 08:05:57 -04:00
Anthony MOI	b6e7a6e2f7	Python - Update PyNormalizer interface	2020-09-23 15:50:01 -04:00
Anthony MOI	8d04b22278	Python - Add support for custom Normalizer	2020-09-23 15:50:01 -04:00
Anthony MOI	940f8bd8fa	Update PyO3 (#426 )	2020-09-22 12:00:20 -04:00
Nicolas Patry	aea22a4004	Adding node bindings. - simplify normalizer. - simplify python bindings.	2020-09-18 12:24:39 +02:00
Nicolas Patry	792d618006	Adding a new "Replace" normalizer that takes a string and replaces it with another String (for now).	2020-09-18 12:24:39 +02:00
Nicolas Patry	75464734df	Adding a new normalizer that strips accents by removing combining (#416 ) * Adding a new normalizer that strips accents by removing combining characters in unicode strings. * Adding Node bindings + better normalizer impl. * Doc comment -> Regular comment.	2020-09-17 09:49:41 +02:00
Nicolas Patry	330876ae02	Improvements on spm parity: (#401 ) * Removing all pre_tokenizer logic from Unigram algorithm. * Improving a lot the parity check. - We can now detect a lot more errors - Special cases have been added temporarily. * Adding 2 new normalizers that mimick spm defaut's behavior. * Adding `encoding_optimized` version of the `encode` algorithm. - Removes Lattice allocation. - Changes trie `common_prefix_search` to return an iterator to avoid allocation of the full results. * Trie<char> -> Trie<u8> Another improvement on speed. * [WIP] Attempt to create a Precompiled Normalizer from SPM to be 100% compliant with arbitrary models. * Adding a new `Precompiled` Normalizer that is replacing `SpmNmtNfkc`. - It will be used for direct compatiblity with `Spm` and replace all their custom rules by using directly the normalizer spec embedded within spm files, removing all need for any rules for us. - We need `nom` dependency to parse the binary format of `spm`. - We need to add `sentencepiece_model_pb2.py` file to be able to read the proto file. - We reimplemented their `Darts::DoubleArray` compact trie format. * Fixing a bug with Precompiled normalizer. * Fixing some edge cases (now in tests) with this weird precompiled normalizer. It seems a very handy crafted trie does not prevent from shooting oneself in the foot. Sorry future reader. * Keep API stable for this PR (change of the API should come later #409). - Removed sentencepiece_model_pb2 from binding and add instructions to make `from_spm` work. * Adding model check in `from_spm`. * Adressing @n1t0's comments. * Adding a check to make sure alignments stay correct. Also added a bit more documentation on how Precompiled works. * Extracting `Precompiled` into it's own `spm_precompiled` crate. * Using ranges in `do_nmt`.	2020-09-15 22:21:02 +02:00
Nicolas Patry	df827d538f	Adding clippy as a linter within the Python binding. (#388 ) * Adding clippy as a linter within the Python binding. * Missing clippy (dropped commit ??)	2020-09-04 09:09:02 -04:00
Nicolas Patry	52082b5476	New clippy comments?	2020-09-02 16:32:50 +02:00
Anthony MOI	504d8c85d8	Remove Tokenizer::normalize This is actually a legacy function that doesn't really make sense now, and is getting really difficult to keep. So we remove it.	2020-08-19 12:42:12 -04:00
Sebastian Puetz	16f75d9efc	Ensure serialization works in all expected ways.	2020-08-04 15:59:33 -04:00
Sebastian Pütz	08b8c48127	Remove Container from Normalizers, replace with Arc. * prefix the Python types in Rust with Py * remove unsound Container wrappers, replace with Arc	2020-08-04 15:59:33 -04:00
Anthony MOI	7a95ffc4fa	BertNormalizer has same behavior than original implem	2020-07-06 13:55:18 -04:00
Anthony MOI	c5bba91bf4	Python - Test and fix classes pickling	2020-05-27 13:46:37 -04:00
Anthony MOI	6a70162d78	Python - Make all relevant classes pickable	2020-05-27 13:46:37 -04:00
Anthony MOI	be7b345bcd	Require Send for all parts of the tokenizer (#222 )	2020-04-08 13:35:06 -04:00
Andre Bogus	550413f00a	add Send + Sync on all traits, remove elsewhere	2020-04-08 18:43:23 +02:00
Bjarte Johansen	2dc48e56ac	Python - Update pyo3 version * Use __new__ instead of static method as model constructors	2020-04-06 21:20:16 +02:00
Morgan Funtowicz	afe9cfe96e	Strip should inherits from Normalizer on Python binding. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-03-31 20:20:09 +02:00
Anthony MOI	f263d7651f	Python - RustFmt	2020-02-18 15:07:34 -05:00
Funtowicz Morgan	bb8321ac0d	Add Strip normalizer (#140 ) * WIP strip. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Rust StripNormalizer Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Allow to specify strip direction Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Renamed StripNormalizer to Strip Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added Python binding. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Makes Strip python compatible with pythonic constructor. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Run RustFmt Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Clippy next ofc. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Move lstrip and rstrip on NormalizedString Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * implment strip() for normalizer + unittests. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Add some more unittests on edge cases. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * clippy and fmt. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Simplify strip and fix offsets * Python - Update strip bindings with default values Co-authored-by: MOI Anthony <xn1t0x@gmail.com>	2020-02-17 11:26:40 +01:00
Bjarte Johansen	0e5d81b400	Implement __new__ on Normalizers __new__ allows Normalizers to be initialized as normal python objects. This also means that Normalizers are given the correct class name.	2020-02-10 10:43:19 +01:00
Anthony MOI	5bc1e2ee05	Add Lowercase Normalizer	2020-01-07 19:40:19 -05:00
Anthony MOI	185b6f0b8b	Add Sequence Normalizer	2020-01-06 21:03:05 -05:00

1 2

53 Commits