tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-03 11:18:29 +00:00

Author	SHA1	Message	Date
Kaito Sugimoto	1bb9884f45	Fixing the vocab size of the trained Unigram model (#952 ) * Fixing the vocab size of the trained Unigram model * add test for the vocab size of the trained Unigram model * Revert "add test for the vocab size of the trained Unigram model" This reverts commit fb8955c831b357d1037548ceaa8789734d544646. * Fixing the vocab size of the trained Unigram model * format codes * get the position of vocab-size calculation out of loop	2022-03-18 18:13:17 +01:00
Nicolas Patry	daa4dd2288	Making the regex in ByteLevel optional. (#939 ) * Making the regex in ByteLevel optional. * Changed the stub. * Beter stub. * Typo fix. * Remove bad comments.	2022-03-18 09:03:20 +01:00
Nicolas Patry	cdabef14c4	Changing `Decoder` trait to be more composable. (#938 ) * Changing `Decoder` trait to be more composable. Fix #872 * Fixing Python side. * Fixing test. * Updating cleanup signature, removing turbofish.	2022-03-17 10:32:09 +01:00
Nicolas Patry	4b6055d4fb	Adding pickling support for trainers (#949 ) * TMP. * Adding support for pickling Python trainers. * Remove not warranted files + missed naming updates. * Stubbing. * Making sure serialized format is written in python tests.	2022-03-14 12:18:11 +01:00
dctelus	71ae5421eb	Python - add initial_alphabet to spm unigram trainer (#942 ) * Python - add initial_alphabet to spm unigram trainer * Python - use optional instead of mutable defaults in spm unigram trainer	2022-03-09 09:54:03 +01:00
dctelus	98249dfb0f	Python - add doctype to length in implementations spm unigram (#943 )	2022-03-08 11:59:07 +01:00
dctelus	4a8f5db067	Python - Add length to train_from_iterator in implementations (#937 )	2022-03-04 14:11:58 +01:00
Luc Georges	845da6d8e8	Feat/m1 manual build (#936 ) * feat(bindings): move target compilation flags to correct config file * feat(bindings): m1 build 'script' * feat(ci): for loop in bdist_wheel script for m1 Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-03-02 14:44:13 +01:00
Nicolas Patry	a4a68de98a	Workarounds publishing issues: - Upgrade package-lock.json (cannot find VS code attempt) - Use published `macro_rules_attribute` so `cargo publish` works.	2022-02-28 11:16:46 +01:00
Nicolas Patry	ffaee13994	Preparing for 0.11.6 release.	2022-02-28 10:20:49 +01:00
Nicolas Patry	2fecdc10dd	Update the CHANGELOG.	2022-02-16 13:07:31 +01:00
Nicolas Patry	5679323bbc	Minor version bump.	2022-02-16 12:51:11 +01:00
Thomas Wang	88d718207a	tokenizer.save has the wrong arguments compared to documentation (#901 ) * tokenizer.save has the wrong arguments compared to documentation * Fixing doc of `save` function. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2022-02-15 17:55:55 +01:00
JC Louis	448054f3c7	fix python3.10 build (#895 )	2022-01-28 17:51:51 +01:00
Nicolas Patry	a8e07d734f	Changelog.	2022-01-17 22:31:54 +01:00
Nicolas Patry	9b85424520	Version bump.	2022-01-17 22:30:25 +01:00
Nicolas Patry	1a84958cc8	Fixing bad deserialization following inclusion of a default for `Punctuation`. (#884 ) * Fixing bad deserialization following inclusion of a default for `Punctuation`. * don't remove the type now... * Adding slow test to run on all the tokenizers of the hub. * `PartialEq` everywhere. * Forcing `type` to exist on the `pre_tokenizers`.	2022-01-17 22:28:25 +01:00
Nicolas Patry	c2fd765087	Update Cargo.lock for Python.	2022-01-17 10:32:46 +01:00
Nicolas Patry	a4cf53f6a7	Update CHANGELOG.	2022-01-17 09:56:56 +01:00
Nicolas Patry	ab9a2f3100	Update versions.	2022-01-17 09:40:01 +01:00
JC Louis	cabbecb96c	add python3.10 release (#877 ) * add missing python3.9 classifier * add python3.10 release * run tests on 3.10 * Revert "run tests on 3.10" This reverts commit ceed64249e54b6ec622b06c59bf47da7c6dfc1b0.	2022-01-12 09:42:13 +01:00
Nicolas Patry	076319d542	Aho corasick version for many added tokens. (#871 ) * Aho corasick version. * Remove test file. * Compile on `stable`.	2022-01-06 16:04:51 +01:00
Nicolas Patry	8e0d66a254	New python version.	2022-01-04 14:58:02 +01:00
Nicolas Patry	6972e49f1d	Fix the clippy warnings. (#869 )	2022-01-04 14:32:07 +01:00
Nicolas Patry	1054e243e2	Fix invalid continuing subwrd prefix. (#864 ) * Creating failing test for invalid continuing subwrd prefix. * Test in rust + the associated fix. * Clippy. * Black.	2022-01-04 14:25:35 +01:00
Nicolas Patry	4122a33f09	Fixing missing `direction` in TruncationParams. (#868 )	2022-01-04 14:21:46 +01:00
Nicolas Patry	7069988ffe	Update to 0.11.1	2021-12-28 13:59:31 +01:00
Nicolas Patry	152880ab3e	Adding truncation_side within `TruncationParams`. (#860 ) * Add truncation to enable_truncation * Fix typo * Adding truncation_side within `TruncationParams`. * Node serialization of this direction param. * Update the test. * Fixing warnings/lint. * Adding stuff (can't local debug :( ) * Slow loop... ;( * Stub.py. Co-authored-by: Niels Rogge <niels.rogge1@gmail.com>	2021-12-28 12:37:06 +01:00
Luc Georges	c4c9de23a5	Feature: Handle invalid truncate direction (#858 ) * refacto: TruncateDirection -> TruncationDirection * feat(node): invalid direction will throw * feat(python): invalid direction will throw * Update bindings/node/lib/bindings/raw-encoding.test.ts * Update bindings/python/tests/bindings/test_encoding.py Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2021-12-27 14:31:57 +01:00
Nicolas Patry	943f4ef469	Preparing for 0.11.0 Re-release. (#856 ) * Starting from master again. Upgrade libssl everywhere on quay Extra is ubuntu based (running the quay in a container). making only extra run + attempt to fix ssl update. Extra with newer openssl versions. `-y`. Use checkoint@v2 + remove `-` from environment name. Debugging back the conda release.. Attempt to use `base` env. 3.7 requires `activate-environement: true. MacOS and windows don't run on manylinux. Remove yum on windows/macOs. Miniconda doesn't like manylinux2014 anymore ? Attempting different approach for manylinux + conda. Use wget. Extra bracet. Executing $filename Activate the env. Activate the env on eevery step that requires it. Openssl-devel. Activating env for extracting version ? Retest all workflows. Manylinux2010 requires checkout@v1 Run on tag for extra and conda again. openssl-devel. * Putting back into deploy state. * Adding links in CHANGELOG. * Remove clippy from changelog.	2021-12-23 16:43:48 +01:00
Luc Georges	04368b1998	Truncate Right (#841 ) * feat(tokenizers): add truncate test case * !feat(tokenizer): truncate right * refacto(tokenizers): clippy * feat(bindings): update bindings for truncate() * fix(tokenizers): remove unsafe code * refacto(tokenizers): truncate direction * truncate direction enum * compute parts ranges beforehand * 2n space because encoding is dropped at the end of procedure * update bindings * add pip install in python bindings' make test * fix(node): clippy asks to use unwrap_or_else * fix(node): lint * refacto(tokenizers): replace Vec<Range<usize>> by Vec<(usize, usize)> * refacto(bindings): add match syntax * refacto(tokenizers): use mem::replace instead of mem::swap * refacto(tokenizers): assign value the normal way	2021-12-23 13:34:21 +01:00
Nicolas Patry	c1100ec542	Clippy fixes. (#846 ) * Clippy fixes. * Drop support for Python 3.6 * Remove other 3.6 * Re-enabling caches for build (5h + seems too long and issue seems solved) https://github.com/actions/virtual-environments/issues/572 * `npm audit fix`. * Fix yaml ? * Pyarrow issue fixed: https://github.com/huggingface/datasets/pull/2268 * Installing dev libraries. * Install python dev elsewhere ? * Typo. * No sudo. * ... * Testing the GH again. * Maybe v2 will fix ? * Fixing tests on MacOS Python 3.8+	2021-12-15 15:55:48 +01:00
Anthony MOI	1dc19e0dd4	Fix Python README example	2021-10-07 16:56:48 +02:00
Anthony MOI	b0ee27847f	Python - Prepare for release 0.11.0 (#799 )	2021-09-08 03:15:47 -04:00
Anthony MOI	b8b584d4e5	Python - Pretty json saving defaults to true (#793 ) * Python - Pretty json saving defaults to true * Update changelog	2021-09-02 08:43:54 -04:00
Anthony Moi	e68aecc442	Python - Update Cargo.lock	2021-09-02 14:04:35 +02:00
Anthony Moi	35c96e5e3f	Add tests for from_pretrained	2021-08-31 09:00:05 -04:00
Anthony Moi	ad7090a5c7	Improve READMEs for from_pretrained	2021-08-31 09:00:05 -04:00
Anthony Moi	a4d0f3dd18	Update docs for from_pretrained	2021-08-31 09:00:05 -04:00
Anthony Moi	6f9e867330	Better export for FromPretrainedParameters	2021-08-31 09:00:05 -04:00
Anthony Moi	e44fdee4a1	Python - Add bindings to Tokenizer.from_pretrained	2021-08-31 09:00:05 -04:00
Geoffrey Thomas	5982498195	Switch git dependencies in Cargo.toml back to regular versions (#728 ) * Switch git dependencies in Cargo.toml back to regular versions rayon-cond turned out to be a rustc bug that has been fixed for a while (see cuviper/rayon-cond#2), so we can revert the git dependency. numpy has released the commit in question as part of 0.12. * Also update Cargo.lock files Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>	2021-08-13 09:32:00 -04:00
Vlad Artamonov	e2bf8daa3a	Add SplitDelimiterBehavior to Punctuation constructor (#657 ) Resolves: #642	2021-08-13 09:19:23 -04:00
kingyiusuen	c1100dcbe3	Fix typo in documentation (#743 ) * Doc - Fix typo (And instance of -> An instance of) * Add missing text_signature for WordLevel.from_file Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>	2021-08-13 08:08:23 -04:00
Sylvain Gugger	6616e699f7	Expand documentation of UnigramTrainer (#770 ) * Expand documentation of UnigramTrainer * Put doc at the source * Add signature * make style Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>	2021-08-12 10:12:26 -04:00
SaulLu	da4c7b10e4	Add a way to specify the unknown token in `SentencePieceUnigramTokenizer` python implem (#762 ) * add a way to specify the unknown token in `SentencePieceUnigramTokenizer` * add test that verify that an exception is raised for the missing unknown token * style * add test tokens	2021-08-12 09:42:44 -04:00
Nicolas Patry	256a71c1f2	Clippy 1.54. (#773 )	2021-08-11 14:43:49 +02:00
Nicolas Patry	d83772d62c	Fixing tokenizers with 1.53 (updated some dependencies + clippy) (#764 )	2021-07-21 09:58:38 +02:00
Anthony MOI	755e5f5c1e	Remove support for Python 3.5 (#714 ) * Python - remove support for python 3.5 * revert ci * revert build-wheels.sh * Update CHANGELOG.md	2021-05-24 17:31:01 -04:00
Anthony MOI	3a002c1aa8	Python - prepare for release 0.10.3	2021-05-24 16:59:10 -04:00

1 2 3 4 5 ...

655 Commits