Commit Graph

591 Commits

Author SHA1 Message Date
57200144ca Python - Fix ByteLevel instantiation from state (#621) 2021-02-04 10:16:05 -05:00
a8f756494e Improve Model serialization/deserialization (#620) 2021-02-04 09:59:18 -05:00
6a29dbc070 Doc - Hotfix training from iterators tutorial 2021-02-03 15:50:09 -05:00
db22cb6315 Python - Fix Normalizer.normalize with PyNormalizedStringRefMut 2021-02-03 15:48:53 -05:00
355315e8d3 Rust - Fix offsets produced by Precompiled Normalizer 2021-02-03 15:46:45 -05:00
96b9972842 Fix SentencePiece tokenizers conversion 2021-02-03 12:44:46 -05:00
719bea76b9 Python - Prepare for release 0.10.0 2021-01-12 16:34:04 -05:00
b9c6bea75e Add fuse_unk option to SentencePieceBPETokenizer (#574)
* Add fuse_unk option to SentencePieceBPETokenizer

* Fix style

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-01-12 16:07:59 -05:00
91dae1de15 Doc - Add documentation for training from iterators 2021-01-12 15:51:38 -05:00
cca5d43038 Python - Fix breaking change in Model.save 2021-01-11 16:09:19 -05:00
49d11b1f69 Python - Add components getter/setters to BaseTokenizer 2021-01-11 16:08:38 -05:00
d94fa220b6 Python - Add train_from_iterator to implementations 2021-01-07 09:02:20 -05:00
817c5ad317 Fix clippy warnings for rust 1.49 2021-01-06 15:03:33 -05:00
5938a12b3f Python - Improve training with iterators 2021-01-06 11:38:43 -05:00
0c6cc39eee Python - Update CHANGELOG and bump for release 2020-12-08 13:29:35 -05:00
8916b6bb27 Add a visualization utility to render tokens and annotations in a notebook (#508)
* Draft functionality of visualization

* Added comments to make code more intelligble

* polish the styles

* Ensure colors are stable and comment the css

* Code clean up

* Made visualizer importable and added some docs

* Fix styling

* implement comments from PR

* Fixed the regex for UNK tokens and examples in notebook

* Converted docs to google format

* Added a notebook showing multiple languages and tokenizers

* Added visual indication of chars that are tokenized with >1 token

* Reorganize things a bit and fix import

* Update docs

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-12-04 10:25:56 -05:00
5549fc4837 Python - Update CHANGELOG 2020-11-28 12:42:37 -05:00
3a8627ce4d Improve docs and fix tests around training 2020-11-28 12:29:35 -05:00
999067454d Make sure we first try to extract a string 2020-11-28 12:29:35 -05:00
ed9baeabb7 Add example for training with datasets 2020-11-28 12:29:35 -05:00
c36ac0bfdf Improve progress tracking while training 2020-11-28 12:29:35 -05:00
75deaecdd0 Also accept iterators of batches in train_from_iterator 2020-11-28 12:29:35 -05:00
e0a70f1fb2 Add ability to train from Iterator 2020-11-28 12:29:35 -05:00
6e364cb685 Python - Update CHANGELOG and stub files 2020-11-27 17:35:34 -05:00
a351d1c604 Python - Trainers can get/set their attributes 2020-11-27 17:35:34 -05:00
3eb7ef6d0a Python - PreTokenizers can get/set their attributes 2020-11-27 17:35:34 -05:00
5c35fafc44 Python - Decoders can get/set their attributes 2020-11-27 17:35:34 -05:00
091287dcf5 Python - Use macro for getter/setter in models 2020-11-27 17:35:34 -05:00
2feccdbbfa Python - PyStrip can get/set its attributes 2020-11-27 17:35:34 -05:00
7512d5e4ce Python - PyBertNormalizer can get/set its attributes 2020-11-27 17:35:34 -05:00
78beae8b7d Python - PyWordLevel can get/set its attributes 2020-11-27 17:35:34 -05:00
760537aad3 Python - PyWordPiece can get/set its attributes 2020-11-27 17:35:34 -05:00
c22cfc31f9 Python - PyNormalizer & PyPreTokenizer use a RwLock 2020-11-27 17:35:34 -05:00
76d3b2128b Python - PyBPE can get/set its attributes 2020-11-27 17:35:34 -05:00
7f3cfebf45 Python - PyModel uses a RwLock to allow modifications 2020-11-27 17:35:34 -05:00
dd399d2ad0 Split Pre-Tokenizer (#542)
* start playing around

* make a first version

* refactor

* apply make format

* add python bindings

* add some python binding tests

* correct pre-tokenizers

* update auto-generated bindings

* lint python bindings

* add code node

* add split to docs

* refactor python binding a bit

* cargo fmt

* clippy and fmt in node

* quick updates and fixes

* Oops

* Update node typings

* Update changelog

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-27 17:07:03 -05:00
58e1d8de67 Python - Improve documentation for trainers 2020-11-23 11:52:51 -05:00
64441b54b1 Python - Improve documentation for post-processors 2020-11-23 11:52:51 -05:00
933a2a9c99 Python - Improve pre-tokenizers docs 2020-11-23 11:52:51 -05:00
5842b3db73 Python - Improve normalizers docs 2020-11-23 11:52:51 -05:00
c01c301743 Python - Improve documentation for decoders and remove useless kwargs 2020-11-23 11:52:51 -05:00
a50d4b7d25 Python - Improve documentation for models 2020-11-23 11:52:51 -05:00
dc60d4fc0c Fix BaseTokenizer enable_truncation docstring 2020-11-23 11:28:26 -05:00
2fbd6779f6 Make sure TrainerWrapper can only train the right Model 2020-11-20 13:30:44 -05:00
13e07da2c8 Node - Add WordLevelTrainer 2020-11-20 13:30:44 -05:00
387b8a1033 Generate pyi, fix tests and clippy warnings 2020-11-20 13:30:44 -05:00
5059be1a8d Test BPE keeping its options after training 2020-11-20 13:30:44 -05:00
284a1dbee7 PyModel uses a RwLock to allow modifications 2020-11-20 13:30:44 -05:00
54c7210b2f Train Model in place
This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.
2020-11-20 13:30:44 -05:00
224862fe0c Python - Make the trainer optional on Tokenizer.train 2020-11-20 13:30:44 -05:00