tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-09-03 15:59:25 +00:00

Author	SHA1	Message	Date
Nicolas Patry	152880ab3e	Adding truncation_side within `TruncationParams`. (#860 ) * Add truncation to enable_truncation * Fix typo * Adding truncation_side within `TruncationParams`. * Node serialization of this direction param. * Update the test. * Fixing warnings/lint. * Adding stuff (can't local debug :( ) * Slow loop... ;( * Stub.py. Co-authored-by: Niels Rogge <niels.rogge1@gmail.com>	2021-12-28 12:37:06 +01:00
Luc Georges	04368b1998	Truncate Right (#841 ) * feat(tokenizers): add truncate test case * !feat(tokenizer): truncate right * refacto(tokenizers): clippy * feat(bindings): update bindings for truncate() * fix(tokenizers): remove unsafe code * refacto(tokenizers): truncate direction * truncate direction enum * compute parts ranges beforehand * 2n space because encoding is dropped at the end of procedure * update bindings * add pip install in python bindings' make test * fix(node): clippy asks to use unwrap_or_else * fix(node): lint * refacto(tokenizers): replace Vec<Range<usize>> by Vec<(usize, usize)> * refacto(bindings): add match syntax * refacto(tokenizers): use mem::replace instead of mem::swap * refacto(tokenizers): assign value the normal way	2021-12-23 13:34:21 +01:00
Anthony MOI	884bfb7970	Prepare node release (#794 ) * Node - Update changelog for release * Update node release to add v14 & v15 Co-authored-by: Huan (李卓桓) <zixia@zixia.net> * Node - Update version number * Node - Update dependencies * Node - Lint Co-authored-by: Huan (李卓桓) <zixia@zixia.net>	2021-09-02 09:58:01 -04:00
Anthony MOI	d3d9f2c76b	words -> word_ids & sequences -> sequence_ids	2020-11-09 16:02:07 -05:00
Anthony MOI	57d162b269	Add an Encoding.sequences to allow masking	2020-11-06 10:41:56 -05:00
Anthony MOI	385d25720a	Simplify the API for Encoding.token_to_XXX	2020-11-06 10:41:56 -05:00
Anthony MOI	a79cc55e08	Node - Encoding mappings handle sequence_id	2020-11-06 10:41:56 -05:00
Nicolas Patry	95cc8c47ad	Changed rust api for merges, that is now Vec<(String, String)>	2020-09-24 08:57:02 +02:00
Nicolas Patry	26cafe0d6c	Fixing eslint.	2020-09-10 14:00:53 -04:00
Anthony MOI	a16d71abd0	Node - Update bindings	2020-08-19 12:42:12 -04:00
Pierric Cistac	e9a2e63a67	Node - Fix new linting errors	2020-07-24 15:44:39 -04:00
Pierric Cistac	a03eba2fe9	Node - Typings proposal	2020-05-27 13:12:47 -04:00
Anthony MOI	b5247f41f1	Node - Update base tokenizer	2020-05-12 18:08:26 -04:00
Anthony MOI	4aecd82d07	Node - Improve mappings on Encoding	2020-04-16 14:23:37 -04:00
Pierric Cistac	38d53a7b84	Node - Expose more bindings	2020-04-13 16:48:32 -04:00
Anthony MOI	3ad1360210	Word indices are None for special tokens	2020-04-09 09:52:02 -04:00
Pierric Cistac	e9667a7b83	Node - `tokenizer.postProcess` bindings	2020-03-26 15:42:45 -04:00
Pierric Cistac	0408567f23	Node - Merge encodings	2020-03-26 15:42:45 -04:00
Pierric Cistac	70552812fe	Node - Bindings for tokenized encoding	2020-03-26 15:42:45 -04:00
Pierric Cistac	ce3cf78ea5	Node - Bindings for Encoding mappings	2020-03-26 15:42:45 -04:00
Pierric Cistac	7dd2400214	Node - Remove `addSpecialTokens` from `BertWordPieceTokenizer`	2020-03-26 15:10:08 -04:00
Pierric Cistac	d25eb075c8	Node - Finalize AddedToken support	2020-03-25 12:36:03 -04:00
Pierric Cistac	f53a885fdd	Node - Expand AddedToken supported use	2020-03-25 11:12:29 -04:00
Pierric Cistac	2aeae555e2	Node - Expose `normalize` on tokenizer	2020-03-18 17:10:26 -04:00
Pierric Cistac	25ef729a5a	Node - Update bindings	2020-03-18 15:13:29 -04:00
Pierric Cistac	3abf615a51	Node - Update bindings	2020-03-10 18:22:36 -04:00
Anthony MOI	523e173ddf	Merge pull request #188 from huggingface/fix-byte-level Fix byte level BPE offsets	2020-03-10 14:37:47 -04:00
Pierric Cistac	7764d3d770	Node - Fix bindings	2020-03-10 14:31:42 -04:00
Anthony MOI	45f3eaaf72	Update bindings and typings	2020-03-10 12:28:24 -04:00
Anthony MOI	efbbfea558	Update ByteLevel PostProcessor	2020-03-10 12:05:04 -04:00
Anthony MOI	aa62c951dc	Node - Update bindings	2020-03-09 22:45:33 -04:00
Pierric Cistac	4510ea5ce3	node: type errors	2020-03-06 18:01:11 -05:00
Pierric Cistac	a44eb2b5cd	node: update bytelevel bindings	2020-03-06 17:44:45 -05:00
Pierric Cistac	dae345cc6d	node: add `continuingSubwordPrefix` to wordpiece model	2020-03-06 16:30:36 -05:00
Pierric Cistac	578eddcdf9	node: expose `decode` / `decodeBatch` in `BaseTokenizer`	2020-03-06 16:27:27 -05:00
Pierric Cistac	ffcd5c63bf	node: make `BPE.fromfiles` async	2020-03-06 16:27:27 -05:00
Pierric Cistac	fe49512d37	node: make `WordPiece.fromFiles` async	2020-03-06 16:06:06 -05:00
Pierric Cistac	917996841d	node: "proxy" raw Encoding with getters	2020-02-26 18:15:16 -05:00
Pierric Cistac	f836f2109b	build ts	2020-01-14 15:20:34 -05:00
Anthony MOI	a95d0e6ba1	Node - Fix import	2020-01-10 16:11:44 -05:00
Pierric Cistac	24c08b2530	fix sentencepiece tokenizer name	2020-01-10 16:03:47 -05:00
Pierric Cistac	80f6d58177	big big big	2020-01-10 14:49:13 -05:00
Pierric Cistac	6b0935d5de	first implementations draft	2020-01-10 11:53:30 -05:00

43 Commits