tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
Anthony MOI	d3c3f5a700	Python - Expose ByteLevel alphabet	2020-01-02 18:06:06 -05:00
Anthony MOI	f0f9aefd07	ByteLevel exposes its alphabet	2020-01-02 17:51:26 -05:00
Anthony MOI	7b12b3cca5	BpeTrainer handles initial alphabet	2020-01-02 15:01:22 -05:00
Anthony MOI	c8a5d2e32a	NormalizedString - Fix removal around edges	2020-01-02 14:17:14 -05:00
Anthony MOI	66b6211705	NormalizedString - Fix added chars at beginning	2020-01-02 14:17:14 -05:00
epwalsh	894ea1f8f0	utilize ::new() in ::default()	2020-01-02 10:56:41 -08:00
Evan Pete Walsh	8ae0f2efdb	set capacity on BPE cache, change Mutex to RwLock, create BpeBuilder (#24 ) * set capacity on BPE cache, create BpeBuilder * add doc comment * switch from Mutex to RwLock * vocab_and_merges	2020-01-02 09:26:50 -08:00
Evan Pete Walsh	e3cf6a7b00	refactor benchmarks (#25 ) * refactor benchmarks * fix * fix CI	2020-01-01 17:07:36 -08:00
epwalsh	138c48d92e	add benchmark on many batches	2020-01-01 16:20:19 -08:00
epwalsh	b09511f5cf	add better single threaded GPT2 benchmark	2020-01-01 15:48:53 -08:00
Anthony MOI	722b61230d	BPE handles UNK token	2020-01-01 14:49:03 -05:00
MOI Anthony	75713ce809	Merge pull request #23 from huggingface/cache Avoid creating unnecessary vectors when accessing cache	2020-01-01 14:47:28 -05:00
epwalsh	65471b4f2c	Merge branch 'master' into cache	2020-01-01 14:10:20 -05:00
epwalsh	9a10acc981	don't create unnecessary vectors when accessing cache	2020-01-01 14:06:31 -05:00
Anthony MOI	a5c5e5840f	Oops - Fix trainer	2020-01-01 13:36:42 -05:00
Anthony MOI	a7a5f9a67f	BpeTrainer handles special tokens and limiting alphabet	2020-01-01 12:54:58 -05:00
Evan Pete Walsh	ebf22198f3	Add benchmark framework and benches for BPE (GPT2) (#4 ) * add benchmarks * fix bench * refactor BPE benchmarks * fix * remove un-needed gitignore * update Cargo.lock * fix * small fix * improve benchmarks * move setup to Makefile * benchmark BPE encode batch * refactor batch benchmark	2020-01-01 07:35:57 -08:00
Anthony MOI	84c7a8623a	Remove all printed logs	2020-01-01 01:45:24 -05:00
Anthony MOI	47e4b00e05	BpeTrainer shows some progress	2020-01-01 01:28:17 -05:00
Anthony MOI	f3aef0e4e6	Fix BPE saving (u32 => String)	2019-12-31 23:15:10 -05:00
Anthony MOI	90dfdc715d	Expose Tokenizer parts	2019-12-31 22:57:47 -05:00
Anthony MOI	90df088054	Fix ByteLevel PreTokenizer I broke it with my last changes. We cannot take a slice of a string by indexing on bytes obviously.	2019-12-31 15:09:51 -05:00
Anthony MOI	f28ca58fd9	[Fix #17 ] BPE & WordPiece models saving	2019-12-31 13:56:28 -05:00
MOI Anthony	2125e4d422	Merge pull request #21 from huggingface/dropout Implement dropout for BPE	2019-12-30 19:39:29 -05:00
epwalsh	b21a5496a7	no cache when dropout	2019-12-30 15:58:16 -08:00
epwalsh	a642807fde	fix clippy warnings	2019-12-30 14:23:32 -08:00
epwalsh	fdb8ffca27	fix comment	2019-12-30 14:18:08 -08:00
epwalsh	b28c3fd04c	add doc	2019-12-30 14:15:26 -08:00
epwalsh	0be9e5a7f0	implement dropout for BPE	2019-12-30 14:14:26 -08:00
MOI Anthony	5194daa0ce	Merge pull request #20 from huggingface/docs Clean up Rust docs	2019-12-30 14:17:14 -05:00
epwalsh	d163bbadae	remove redundant headers, other small cleanups	2019-12-30 10:46:56 -08:00
Anthony MOI	225a886382	Python - Expose Whitespace PreTokenizer	2019-12-30 13:10:33 -05:00
Anthony MOI	4677a09626	Python - Expose pad and truncate on Encoding	2019-12-30 12:56:07 -05:00
Anthony MOI	8ddb2de64e	Update unicode-normalization to published crate	2019-12-30 12:18:00 -05:00
MOI Anthony	f5327f977e	Merge pull request #19 from huggingface/handle-offsets Handle offsets	2019-12-30 10:46:30 -05:00
Anthony MOI	06d515d41b	Python - Add ability to retrieve a range of string	2019-12-29 01:37:03 -05:00
Anthony MOI	049029dc42	Python - Restore methods on Encoding	2019-12-29 01:26:42 -05:00
Anthony MOI	708a63514a	Add ability to retrieve ranges or NormalizedString	2019-12-29 01:22:16 -05:00
Anthony MOI	9c574ad1b7	Python - Fix some import warnings	2019-12-29 00:43:32 -05:00
Anthony MOI	3779bf3e19	Python - Update example	2019-12-29 00:38:37 -05:00
Anthony MOI	3dcf9f763c	Python - Update pre tokenizers with offsets	2019-12-29 00:37:58 -05:00
Anthony MOI	3f79d9d5e0	Python - Add normalizers bindings & BertNormalizer	2019-12-29 00:36:09 -05:00
Anthony MOI	81be029881	Fix - Handle errors during normalization	2019-12-29 00:24:01 -05:00
Anthony MOI	79b96dccd0	Fix lowercase/uppercase normalization Since each character being lowercased or uppercased can actually generate one or more characters, we need to keep track of the offsets being updated in the process.	2019-12-29 00:19:49 -05:00
Anthony MOI	22ffa716a1	BertPreTokenizer pre tokenize only (with offsets)	2019-12-29 00:12:24 -05:00
Anthony MOI	cda9fae992	Add BertNormalizer with offsets tracking	2019-12-29 00:10:45 -05:00
Anthony MOI	ad9cc52d83	ByteLevel PreTokenizer handles offsets	2019-12-29 00:08:42 -05:00
Anthony MOI	35a8dfdd55	Whitespace PreTokenizer handles offsets	2019-12-28 15:50:42 -05:00
Anthony MOI	be00a1e45e	Improve clarity for BertProcessing	2019-12-28 15:45:51 -05:00
Anthony MOI	d7af007539	BertProcessor handles NormalizedString merging	2019-12-28 15:30:57 -05:00

... 32 33 34 35 36 ...

1870 Commits