tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-31 12:39:21 +00:00

Author	SHA1	Message	Date
Anthony MOI	57200144ca	Python - Fix ByteLevel instantiation from state (#621 )	2021-02-04 10:16:05 -05:00
Anthony MOI	a8f756494e	Improve Model serialization/deserialization (#620 )	2021-02-04 09:59:18 -05:00
Anthony MOI	6a29dbc070	Doc - Hotfix training from iterators tutorial	2021-02-03 15:50:09 -05:00
Anthony MOI	db22cb6315	Python - Fix Normalizer.normalize with PyNormalizedStringRefMut	2021-02-03 15:48:53 -05:00
Anthony MOI	355315e8d3	Rust - Fix offsets produced by Precompiled Normalizer	2021-02-03 15:46:45 -05:00
Anthony MOI	96b9972842	Fix SentencePiece tokenizers conversion	2021-02-03 12:44:46 -05:00
Anthony MOI	719bea76b9	Python - Prepare for release 0.10.0	2021-01-12 16:34:04 -05:00
devfon	b9c6bea75e	Add `fuse_unk` option to SentencePieceBPETokenizer (#574 ) * Add fuse_unk option to SentencePieceBPETokenizer * Fix style Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2021-01-12 16:07:59 -05:00
Anthony MOI	91dae1de15	Doc - Add documentation for training from iterators	2021-01-12 15:51:38 -05:00
Anthony MOI	cca5d43038	Python - Fix breaking change in Model.save	2021-01-11 16:09:19 -05:00
Anthony MOI	49d11b1f69	Python - Add components getter/setters to BaseTokenizer	2021-01-11 16:08:38 -05:00
Anthony MOI	d94fa220b6	Python - Add train_from_iterator to implementations	2021-01-07 09:02:20 -05:00
Anthony MOI	817c5ad317	Fix clippy warnings for rust 1.49	2021-01-06 15:03:33 -05:00
Anthony MOI	5938a12b3f	Python - Improve training with iterators	2021-01-06 11:38:43 -05:00
Anthony MOI	0c6cc39eee	Python - Update CHANGELOG and bump for release	2020-12-08 13:29:35 -05:00
Tal Perry	8916b6bb27	Add a visualization utility to render tokens and annotations in a notebook (#508 ) * Draft functionality of visualization * Added comments to make code more intelligble * polish the styles * Ensure colors are stable and comment the css * Code clean up * Made visualizer importable and added some docs * Fix styling * implement comments from PR * Fixed the regex for UNK tokens and examples in notebook * Converted docs to google format * Added a notebook showing multiple languages and tokenizers * Added visual indication of chars that are tokenized with >1 token * Reorganize things a bit and fix import * Update docs Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-12-04 10:25:56 -05:00
Anthony MOI	5549fc4837	Python - Update CHANGELOG	2020-11-28 12:42:37 -05:00
Anthony MOI	3a8627ce4d	Improve docs and fix tests around training	2020-11-28 12:29:35 -05:00
Anthony MOI	999067454d	Make sure we first try to extract a string	2020-11-28 12:29:35 -05:00
Anthony MOI	ed9baeabb7	Add example for training with datasets	2020-11-28 12:29:35 -05:00
Anthony MOI	c36ac0bfdf	Improve progress tracking while training	2020-11-28 12:29:35 -05:00
Anthony MOI	75deaecdd0	Also accept iterators of batches in train_from_iterator	2020-11-28 12:29:35 -05:00
Anthony MOI	e0a70f1fb2	Add ability to train from Iterator	2020-11-28 12:29:35 -05:00
Anthony MOI	6e364cb685	Python - Update CHANGELOG and stub files	2020-11-27 17:35:34 -05:00
Anthony MOI	a351d1c604	Python - Trainers can get/set their attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	3eb7ef6d0a	Python - PreTokenizers can get/set their attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	5c35fafc44	Python - Decoders can get/set their attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	091287dcf5	Python - Use macro for getter/setter in models	2020-11-27 17:35:34 -05:00
Anthony MOI	2feccdbbfa	Python - PyStrip can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	7512d5e4ce	Python - PyBertNormalizer can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	78beae8b7d	Python - PyWordLevel can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	760537aad3	Python - PyWordPiece can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	c22cfc31f9	Python - PyNormalizer & PyPreTokenizer use a RwLock	2020-11-27 17:35:34 -05:00
Anthony MOI	76d3b2128b	Python - PyBPE can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	7f3cfebf45	Python - PyModel uses a RwLock to allow modifications	2020-11-27 17:35:34 -05:00
Patrick von Platen	dd399d2ad0	Split Pre-Tokenizer (#542 ) * start playing around * make a first version * refactor * apply make format * add python bindings * add some python binding tests * correct pre-tokenizers * update auto-generated bindings * lint python bindings * add code node * add split to docs * refactor python binding a bit * cargo fmt * clippy and fmt in node * quick updates and fixes * Oops * Update node typings * Update changelog Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-27 17:07:03 -05:00
Anthony MOI	58e1d8de67	Python - Improve documentation for trainers	2020-11-23 11:52:51 -05:00
Anthony MOI	64441b54b1	Python - Improve documentation for post-processors	2020-11-23 11:52:51 -05:00
Anthony MOI	933a2a9c99	Python - Improve pre-tokenizers docs	2020-11-23 11:52:51 -05:00
Anthony MOI	5842b3db73	Python - Improve normalizers docs	2020-11-23 11:52:51 -05:00
Anthony MOI	c01c301743	Python - Improve documentation for decoders and remove useless kwargs	2020-11-23 11:52:51 -05:00
Anthony MOI	a50d4b7d25	Python - Improve documentation for models	2020-11-23 11:52:51 -05:00
Nick	dc60d4fc0c	Fix BaseTokenizer enable_truncation docstring	2020-11-23 11:28:26 -05:00
Anthony MOI	2fbd6779f6	Make sure TrainerWrapper can only train the right Model	2020-11-20 13:30:44 -05:00
Anthony MOI	13e07da2c8	Node - Add WordLevelTrainer	2020-11-20 13:30:44 -05:00
Anthony MOI	387b8a1033	Generate pyi, fix tests and clippy warnings	2020-11-20 13:30:44 -05:00
Anthony MOI	5059be1a8d	Test BPE keeping its options after training	2020-11-20 13:30:44 -05:00
Anthony MOI	284a1dbee7	PyModel uses a RwLock to allow modifications	2020-11-20 13:30:44 -05:00
Anthony MOI	54c7210b2f	Train Model in place This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.	2020-11-20 13:30:44 -05:00
Anthony MOI	224862fe0c	Python - Make the trainer optional on Tokenizer.train	2020-11-20 13:30:44 -05:00

1 2 3 4 5 ...

591 Commits