tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-07 05:08:24 +00:00

Author	SHA1	Message	Date
Anthony MOI	a9be177185	Update CHANGELOGs	2020-03-10 13:12:34 -04:00
Anthony MOI	28f022058c	Keep default values as true	2020-03-10 12:58:53 -04:00
Anthony MOI	45f3eaaf72	Update bindings and typings	2020-03-10 12:28:24 -04:00
Anthony MOI	efbbfea558	Update ByteLevel PostProcessor	2020-03-10 12:05:04 -04:00
Anthony MOI	7e9003ccb7	Python - Update bindings	2020-03-09 18:37:03 -04:00
Anthony MOI	86d2e90ad2	Update CHANGELOGs	2020-03-06 17:44:44 -05:00
Anthony MOI	d778ed5e0a	Python - Update README and implementation	2020-03-06 17:44:44 -05:00
Anthony MOI	52180a9179	Python - Add ByteLevel PostProcessor	2020-03-06 17:44:44 -05:00
Anthony MOI	b60eef5245	Python - Make style	2020-03-06 17:44:44 -05:00
Anthony MOI	d8e7a830b2	Update CHANGELOGs	2020-03-06 17:44:34 -05:00
Anthony MOI	b2e5f54b6f	Python - Fix ByteLevelBPETokenizer implementation	2020-03-06 17:44:03 -05:00
Anthony MOI	f1460fadb9	Python - Update docs and implementations	2020-03-06 17:44:03 -05:00
Anthony MOI	2393506dc7	Python - Add ByteLevel Normalizer	2020-03-06 17:44:03 -05:00
Anthony MOI	47cef0e13a	Python - Fix BPE and WordPiece builders usage	2020-03-06 12:20:39 -05:00
Anthony MOI	4b596e19dd	Rust - Improve training progress for multiple files	2020-03-03 11:04:24 -05:00
Anthony MOI	8e791791d1	Python - prepare for release	2020-03-02 14:56:42 -05:00
Anthony MOI	4deeb9511f	Update CHANGELOGs	2020-03-02 14:37:17 -05:00
Anthony MOI	f8f0702d98	Fix LongestFirst truncation strategy	2020-02-29 16:26:13 -05:00
Anthony MOI	657f8b6c15	Rust & Python - Update CHANGELOGs	2020-02-26 11:30:44 -05:00
Anthony MOI	3b10d640d5	Rust & Python - Update CHANGELOGs	2020-02-26 10:51:40 -05:00
Anthony MOI	2425fe877d	Python - Update CHANGELOG	2020-02-26 09:31:17 -05:00
Anthony MOI	61b4c9c30a	Python - Add missing tokens to BertWordPieceTokenizer	2020-02-26 09:21:54 -05:00
Anthony MOI	440e8e9bd9	Python - Bump version for release	2020-02-24 16:08:49 -05:00
Anthony MOI	be08d9574c	Python - Add Changelog	2020-02-24 10:12:50 -05:00
Anthony MOI	999088ef94	Python - Bump version for release	2020-02-24 09:56:08 -05:00
Morgan Funtowicz	817b760ab9	Make name parameter Optional[str] on BaseTokenizer Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-22 14:57:43 +01:00
Morgan Funtowicz	d274a7691d	Avoid breaking changes and let parameter name be Optional. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-22 14:56:59 +01:00
Morgan Funtowicz	0fc8be9d69	Formatting for python binding. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-22 00:17:44 +01:00
Morgan Funtowicz	f88a6b40ac	Make parameter name on Model.save() optional. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-22 00:01:32 +01:00
Anthony MOI	11dd6c8bae	Python - Bump version for release	2020-02-18 18:49:11 -05:00
Anthony MOI	41929462c7	Python - Add classifiers	2020-02-18 18:48:21 -05:00
Anthony MOI	d8a73c89a7	Python - Add Encoding length	2020-02-18 18:24:13 -05:00
Anthony MOI	d48fdbe057	Python - Only add special tokens when in-vocabulary	2020-02-18 17:27:27 -05:00
Anthony MOI	5daf1eea86	Python - Replace last BPETokenizer occurences	2020-02-18 16:25:59 -05:00
Anthony MOI	f263d7651f	Python - RustFmt	2020-02-18 15:07:34 -05:00
Anthony MOI	8e9fae6be4	Python - Add `check-style` to Makefile	2020-02-18 11:11:07 -05:00
Anthony MOI	81be207819	Python - Black auto formatting	2020-02-18 10:45:36 -05:00
Anthony MOI	4706151c32	Python - Add Makefile with Black formatting	2020-02-18 10:45:10 -05:00
Anthony MOI	1509f747af	Python - Uniformize implementations parameters	2020-02-18 10:27:10 -05:00
MOI Anthony	3512bd3400	Merge pull request #149 from colinclement/master Allow dropout option in ByteLevelBPETokenizer	2020-02-18 09:59:40 -05:00
Morgan Funtowicz	891dd4adb8	Fix invalid num_added_tokens method call in BaseTokenizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-17 15:32:34 +01:00
Funtowicz Morgan	bb8321ac0d	Add Strip normalizer (#140 ) * WIP strip. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Rust StripNormalizer Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Allow to specify strip direction Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Renamed StripNormalizer to Strip Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added Python binding. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Makes Strip python compatible with pythonic constructor. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Run RustFmt Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Clippy next ofc. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Move lstrip and rstrip on NormalizedString Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * implment strip() for normalizer + unittests. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Add some more unittests on edge cases. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * clippy and fmt. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Simplify strip and fix offsets * Python - Update strip bindings with default values Co-authored-by: MOI Anthony <xn1t0x@gmail.com>	2020-02-17 11:26:40 +01:00
Colin Clement	e591cfce7b	pass through dropout option in ByteLevelBPETokenizer	2020-02-15 01:58:55 +00:00
MOI Anthony	3cac26cdb2	Merge pull request #147 from huggingface/wordpiece-cleanup Wordpiece Decoder cleanup	2020-02-14 13:12:15 -05:00
Funtowicz Morgan	c4bac6aeeb	Expose num_added_tokens on Python side (#146 ) * Expose num_added_tokens on Python side without the need to pass an Encoding to added_tokens. This allows to compute the max sentence length for single/pair inputs without actually the need to have an Encoding structure. As the number of added tokens is fixed and static during compilation it allows more flexible usage of the method. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Renamed num_added_tokens to num_special_tokens_to_add. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-14 10:55:20 +00:00
Anthony MOI	1907b74d1c	Python - Bindings for Wordpiece decoder's cleanup	2020-02-13 17:50:37 -05:00
Anthony MOI	5bd93ee822	Python - hotfix BertWordPieceTokenizer decoder	2020-02-13 16:31:00 -05:00
Anthony MOI	bbbd97c7e1	Python - Bump version for release	2020-02-11 08:15:11 -05:00
Anthony MOI	08ce105195	Python - Hotfix WordPieceTrainer constructor	2020-02-11 08:13:57 -05:00
Anthony MOI	c1ddfdac8c	Python - bump version for release	2020-02-10 23:23:27 -05:00

1 2 3 4 5 ...

300 Commits