Commit Graph

591 Commits

Author SHA1 Message Date
Anthony MOI
57200144ca Python - Fix ByteLevel instantiation from state (#621) 2021-02-04 10:16:05 -05:00
Anthony MOI
a8f756494e Improve Model serialization/deserialization (#620) 2021-02-04 09:59:18 -05:00
Anthony MOI
6a29dbc070 Doc - Hotfix training from iterators tutorial 2021-02-03 15:50:09 -05:00
Anthony MOI
db22cb6315 Python - Fix Normalizer.normalize with PyNormalizedStringRefMut 2021-02-03 15:48:53 -05:00
Anthony MOI
355315e8d3 Rust - Fix offsets produced by Precompiled Normalizer 2021-02-03 15:46:45 -05:00
Anthony MOI
96b9972842 Fix SentencePiece tokenizers conversion 2021-02-03 12:44:46 -05:00
Anthony MOI
719bea76b9 Python - Prepare for release 0.10.0 2021-01-12 16:34:04 -05:00
devfon
b9c6bea75e Add fuse_unk option to SentencePieceBPETokenizer (#574)
* Add fuse_unk option to SentencePieceBPETokenizer

* Fix style

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-01-12 16:07:59 -05:00
Anthony MOI
91dae1de15 Doc - Add documentation for training from iterators 2021-01-12 15:51:38 -05:00
Anthony MOI
cca5d43038 Python - Fix breaking change in Model.save 2021-01-11 16:09:19 -05:00
Anthony MOI
49d11b1f69 Python - Add components getter/setters to BaseTokenizer 2021-01-11 16:08:38 -05:00
Anthony MOI
d94fa220b6 Python - Add train_from_iterator to implementations 2021-01-07 09:02:20 -05:00
Anthony MOI
817c5ad317 Fix clippy warnings for rust 1.49 2021-01-06 15:03:33 -05:00
Anthony MOI
5938a12b3f Python - Improve training with iterators 2021-01-06 11:38:43 -05:00
Anthony MOI
0c6cc39eee Python - Update CHANGELOG and bump for release 2020-12-08 13:29:35 -05:00
Tal Perry
8916b6bb27 Add a visualization utility to render tokens and annotations in a notebook (#508)
* Draft functionality of visualization

* Added comments to make code more intelligble

* polish the styles

* Ensure colors are stable and comment the css

* Code clean up

* Made visualizer importable and added some docs

* Fix styling

* implement comments from PR

* Fixed the regex for UNK tokens and examples in notebook

* Converted docs to google format

* Added a notebook showing multiple languages and tokenizers

* Added visual indication of chars that are tokenized with >1 token

* Reorganize things a bit and fix import

* Update docs

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-12-04 10:25:56 -05:00
Anthony MOI
5549fc4837 Python - Update CHANGELOG 2020-11-28 12:42:37 -05:00
Anthony MOI
3a8627ce4d Improve docs and fix tests around training 2020-11-28 12:29:35 -05:00
Anthony MOI
999067454d Make sure we first try to extract a string 2020-11-28 12:29:35 -05:00
Anthony MOI
ed9baeabb7 Add example for training with datasets 2020-11-28 12:29:35 -05:00
Anthony MOI
c36ac0bfdf Improve progress tracking while training 2020-11-28 12:29:35 -05:00
Anthony MOI
75deaecdd0 Also accept iterators of batches in train_from_iterator 2020-11-28 12:29:35 -05:00
Anthony MOI
e0a70f1fb2 Add ability to train from Iterator 2020-11-28 12:29:35 -05:00
Anthony MOI
6e364cb685 Python - Update CHANGELOG and stub files 2020-11-27 17:35:34 -05:00
Anthony MOI
a351d1c604 Python - Trainers can get/set their attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
3eb7ef6d0a Python - PreTokenizers can get/set their attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
5c35fafc44 Python - Decoders can get/set their attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
091287dcf5 Python - Use macro for getter/setter in models 2020-11-27 17:35:34 -05:00
Anthony MOI
2feccdbbfa Python - PyStrip can get/set its attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
7512d5e4ce Python - PyBertNormalizer can get/set its attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
78beae8b7d Python - PyWordLevel can get/set its attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
760537aad3 Python - PyWordPiece can get/set its attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
c22cfc31f9 Python - PyNormalizer & PyPreTokenizer use a RwLock 2020-11-27 17:35:34 -05:00
Anthony MOI
76d3b2128b Python - PyBPE can get/set its attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
7f3cfebf45 Python - PyModel uses a RwLock to allow modifications 2020-11-27 17:35:34 -05:00
Patrick von Platen
dd399d2ad0 Split Pre-Tokenizer (#542)
* start playing around

* make a first version

* refactor

* apply make format

* add python bindings

* add some python binding tests

* correct pre-tokenizers

* update auto-generated bindings

* lint python bindings

* add code node

* add split to docs

* refactor python binding a bit

* cargo fmt

* clippy and fmt in node

* quick updates and fixes

* Oops

* Update node typings

* Update changelog

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-27 17:07:03 -05:00
Anthony MOI
58e1d8de67 Python - Improve documentation for trainers 2020-11-23 11:52:51 -05:00
Anthony MOI
64441b54b1 Python - Improve documentation for post-processors 2020-11-23 11:52:51 -05:00
Anthony MOI
933a2a9c99 Python - Improve pre-tokenizers docs 2020-11-23 11:52:51 -05:00
Anthony MOI
5842b3db73 Python - Improve normalizers docs 2020-11-23 11:52:51 -05:00
Anthony MOI
c01c301743 Python - Improve documentation for decoders and remove useless kwargs 2020-11-23 11:52:51 -05:00
Anthony MOI
a50d4b7d25 Python - Improve documentation for models 2020-11-23 11:52:51 -05:00
Nick
dc60d4fc0c Fix BaseTokenizer enable_truncation docstring 2020-11-23 11:28:26 -05:00
Anthony MOI
2fbd6779f6 Make sure TrainerWrapper can only train the right Model 2020-11-20 13:30:44 -05:00
Anthony MOI
13e07da2c8 Node - Add WordLevelTrainer 2020-11-20 13:30:44 -05:00
Anthony MOI
387b8a1033 Generate pyi, fix tests and clippy warnings 2020-11-20 13:30:44 -05:00
Anthony MOI
5059be1a8d Test BPE keeping its options after training 2020-11-20 13:30:44 -05:00
Anthony MOI
284a1dbee7 PyModel uses a RwLock to allow modifications 2020-11-20 13:30:44 -05:00
Anthony MOI
54c7210b2f Train Model in place
This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.
2020-11-20 13:30:44 -05:00
Anthony MOI
224862fe0c Python - Make the trainer optional on Tokenizer.train 2020-11-20 13:30:44 -05:00