tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
Anthony MOI	f4df7f5e2a	Update Tokenizer with NormalizedString & Encoding	2019-12-28 15:28:44 -05:00
Anthony MOI	4afcb1ef96	PreTokenizers handle offsets	2019-12-28 15:28:21 -05:00
Anthony MOI	8c40c89836	Encoding uses NormalizedString	2019-12-28 15:25:50 -05:00
Anthony MOI	162829b7a9	Introduce NormalizedString	2019-12-28 15:24:09 -05:00
Anthony MOI	96ef467bbf	Use forked unicode-normalization	2019-12-28 15:22:52 -05:00
Anthony MOI	a4beecf944	WordPiece handles offsets	2019-12-28 15:22:03 -05:00
Anthony MOI	5682627223	BPE handles offsets	2019-12-28 15:21:50 -05:00
Anthony MOI	5d9848ad6c	Models handles offsets	2019-12-28 15:21:29 -05:00
Anthony MOI	839239d3b4	Bump version	2019-12-27 10:43:34 -05:00
Anthony MOI	bddf7ba737	Python - Fix building from wheels	2019-12-27 10:39:19 -05:00
Anthony MOI	ffd28ba558	Bump for release	2019-12-26 14:56:13 -05:00
Anthony MOI	74cc6f6bde	Python - Simplify padding interface	2019-12-26 14:34:13 -05:00
Anthony MOI	d1e59e09bf	Fix a bug when adding special tokens If we add special tokens that are part of the vocabulary of the model, the tokens aren't added to the tokenizer, which then built an empty regex. This completely break the tokenization	2019-12-26 14:32:50 -05:00
Anthony MOI	d93d4fc3cd	Python - Simplify truncation interface	2019-12-26 10:35:20 -05:00
Anthony MOI	a7734ffc9f	Python - Update doc and readme for add_prefix_space	2019-12-26 10:34:53 -05:00
Anthony MOI	1879cb0bcb	Python - change with_added_tokens as kwarg	2019-12-25 22:22:35 -05:00
Anthony MOI	905c1eb77e	Python - update some packages	2019-12-25 22:16:43 -05:00
Anthony MOI	597031b973	Python - remove unused variable	2019-12-25 22:16:11 -05:00
Anthony MOI	9d289d357d	Python - change add_prefix_space as kwarg	2019-12-25 22:10:17 -05:00
Anthony MOI	4bc5a7bbe7	Python - fix example	2019-12-24 11:20:40 -05:00
Anthony MOI	cf0e8917cd	Fix whitespace handling in ByteLevel	2019-12-24 11:20:26 -05:00
Evan Pete Walsh	9f1421a04b	remove Cargo.lock (#7 )	2019-12-23 21:22:42 -08:00
epwalsh	c0ed873c4d	simplify initialization of BpeTrainer	2019-12-23 20:13:48 -05:00
Anthony MOI	fab1d4cabc	Bump version for release	2019-12-23 17:28:38 -05:00
Anthony MOI	e01d4f2052	Python - Remove misleading __repr__	2019-12-23 17:27:59 -05:00
Anthony MOI	2159123d7c	Fix truncate	2019-12-23 17:27:43 -05:00
MOI Anthony	8fb94be3d0	Merge pull request #6 from huggingface/BPE-tests Add BPE tests and documentation	2019-12-20 15:34:38 -05:00
Evan Pete Walsh	9a91016877	Merge branch 'master' into BPE-tests	2019-12-20 08:55:41 -08:00
Anthony MOI	2266960ef7	Bump version and update Readme	2019-12-20 10:26:40 -05:00
Anthony MOI	f2b9c30ad9	Handle vocab size with added tokens	2019-12-19 20:19:56 -05:00
Anthony MOI	b7040e0412	Option to skip special tokens while decoding	2019-12-19 20:03:02 -05:00
Anthony MOI	a8d68d516d	Handle special tokens	2019-12-19 19:48:16 -05:00
Anthony MOI	7f032b62df	Include the added tokens while converting tokens and ids	2019-12-19 18:32:37 -05:00
Anthony MOI	076ba297fb	Cannot add new tokens that already exist in the vocab	2019-12-19 18:32:03 -05:00
epwalsh	6d51e7a393	add example / doc test for BPE trainer	2019-12-19 15:28:58 -08:00
epwalsh	69212e17e9	formatting	2019-12-19 15:07:27 -08:00
epwalsh	a16daa78f1	add test for word merge	2019-12-19 14:45:38 -08:00
epwalsh	184b09e3ac	add more tests	2019-12-18 17:40:13 -08:00
epwalsh	1dc0debe36	add initial test	2019-12-18 16:45:11 -08:00
Anthony MOI	9763282d59	Bump version for release	2019-12-17 18:42:34 -05:00
Anthony MOI	4d14b08afe	ByteLevel handles prefix spaces	2019-12-17 18:35:40 -05:00
Anthony MOI	6766585965	Python - Do not expose non working features of Encoding	2019-12-17 17:43:42 -05:00
Anthony MOI	0a3d4a86a9	Python - Update bindings for BertPreTokenizer	2019-12-17 17:40:56 -05:00
Anthony MOI	e54eee7657	BasicPreTokenizer => BertPreTokenizer	2019-12-17 17:37:13 -05:00
Anthony MOI	1b66d87fd3	BasicPreTokenizer handles do_basic_tokenize for Bert	2019-12-17 17:35:26 -05:00
Anthony MOI	3f95248d6d	Python - Truncation & padding bindings	2019-12-17 17:24:53 -05:00
Anthony MOI	5729d3656a	Tokenizer handles Truncation and Padding	2019-12-17 15:15:58 -05:00
Anthony MOI	4c51399b00	An Encoding can be padded	2019-12-17 14:23:37 -05:00
Anthony MOI	08eb163415	Bump version for release	2019-12-16 19:38:33 -05:00
Anthony MOI	d80f752ec9	Python - Add some missing Encoding bindings	2019-12-16 19:38:18 -05:00

... 33 34 35 36 37 ...

1870 Commits