57200144ca
Python - Fix ByteLevel instantiation from state ( #621 )
2021-02-04 10:16:05 -05:00
a8f756494e
Improve Model serialization/deserialization ( #620 )
2021-02-04 09:59:18 -05:00
6a29dbc070
Doc - Hotfix training from iterators tutorial
2021-02-03 15:50:09 -05:00
db22cb6315
Python - Fix Normalizer.normalize with PyNormalizedStringRefMut
2021-02-03 15:48:53 -05:00
355315e8d3
Rust - Fix offsets produced by Precompiled Normalizer
2021-02-03 15:46:45 -05:00
96b9972842
Fix SentencePiece tokenizers conversion
2021-02-03 12:44:46 -05:00
719bea76b9
Python - Prepare for release 0.10.0
2021-01-12 16:34:04 -05:00
b9c6bea75e
Add fuse_unk
option to SentencePieceBPETokenizer ( #574 )
...
* Add fuse_unk option to SentencePieceBPETokenizer
* Fix style
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com >
2021-01-12 16:07:59 -05:00
91dae1de15
Doc - Add documentation for training from iterators
2021-01-12 15:51:38 -05:00
cca5d43038
Python - Fix breaking change in Model.save
2021-01-11 16:09:19 -05:00
49d11b1f69
Python - Add components getter/setters to BaseTokenizer
2021-01-11 16:08:38 -05:00
d94fa220b6
Python - Add train_from_iterator to implementations
2021-01-07 09:02:20 -05:00
817c5ad317
Fix clippy warnings for rust 1.49
2021-01-06 15:03:33 -05:00
5938a12b3f
Python - Improve training with iterators
2021-01-06 11:38:43 -05:00
0c6cc39eee
Python - Update CHANGELOG and bump for release
2020-12-08 13:29:35 -05:00
8916b6bb27
Add a visualization utility to render tokens and annotations in a notebook ( #508 )
...
* Draft functionality of visualization
* Added comments to make code more intelligble
* polish the styles
* Ensure colors are stable and comment the css
* Code clean up
* Made visualizer importable and added some docs
* Fix styling
* implement comments from PR
* Fixed the regex for UNK tokens and examples in notebook
* Converted docs to google format
* Added a notebook showing multiple languages and tokenizers
* Added visual indication of chars that are tokenized with >1 token
* Reorganize things a bit and fix import
* Update docs
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com >
2020-12-04 10:25:56 -05:00
5549fc4837
Python - Update CHANGELOG
2020-11-28 12:42:37 -05:00
3a8627ce4d
Improve docs and fix tests around training
2020-11-28 12:29:35 -05:00
999067454d
Make sure we first try to extract a string
2020-11-28 12:29:35 -05:00
ed9baeabb7
Add example for training with datasets
2020-11-28 12:29:35 -05:00
c36ac0bfdf
Improve progress tracking while training
2020-11-28 12:29:35 -05:00
75deaecdd0
Also accept iterators of batches in train_from_iterator
2020-11-28 12:29:35 -05:00
e0a70f1fb2
Add ability to train from Iterator
2020-11-28 12:29:35 -05:00
6e364cb685
Python - Update CHANGELOG and stub files
2020-11-27 17:35:34 -05:00
a351d1c604
Python - Trainers can get/set their attributes
2020-11-27 17:35:34 -05:00
3eb7ef6d0a
Python - PreTokenizers can get/set their attributes
2020-11-27 17:35:34 -05:00
5c35fafc44
Python - Decoders can get/set their attributes
2020-11-27 17:35:34 -05:00
091287dcf5
Python - Use macro for getter/setter in models
2020-11-27 17:35:34 -05:00
2feccdbbfa
Python - PyStrip can get/set its attributes
2020-11-27 17:35:34 -05:00
7512d5e4ce
Python - PyBertNormalizer can get/set its attributes
2020-11-27 17:35:34 -05:00
78beae8b7d
Python - PyWordLevel can get/set its attributes
2020-11-27 17:35:34 -05:00
760537aad3
Python - PyWordPiece can get/set its attributes
2020-11-27 17:35:34 -05:00
c22cfc31f9
Python - PyNormalizer & PyPreTokenizer use a RwLock
2020-11-27 17:35:34 -05:00
76d3b2128b
Python - PyBPE can get/set its attributes
2020-11-27 17:35:34 -05:00
7f3cfebf45
Python - PyModel uses a RwLock to allow modifications
2020-11-27 17:35:34 -05:00
dd399d2ad0
Split Pre-Tokenizer ( #542 )
...
* start playing around
* make a first version
* refactor
* apply make format
* add python bindings
* add some python binding tests
* correct pre-tokenizers
* update auto-generated bindings
* lint python bindings
* add code node
* add split to docs
* refactor python binding a bit
* cargo fmt
* clippy and fmt in node
* quick updates and fixes
* Oops
* Update node typings
* Update changelog
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com >
2020-11-27 17:07:03 -05:00
58e1d8de67
Python - Improve documentation for trainers
2020-11-23 11:52:51 -05:00
64441b54b1
Python - Improve documentation for post-processors
2020-11-23 11:52:51 -05:00
933a2a9c99
Python - Improve pre-tokenizers docs
2020-11-23 11:52:51 -05:00
5842b3db73
Python - Improve normalizers docs
2020-11-23 11:52:51 -05:00
c01c301743
Python - Improve documentation for decoders and remove useless kwargs
2020-11-23 11:52:51 -05:00
a50d4b7d25
Python - Improve documentation for models
2020-11-23 11:52:51 -05:00
dc60d4fc0c
Fix BaseTokenizer enable_truncation docstring
2020-11-23 11:28:26 -05:00
2fbd6779f6
Make sure TrainerWrapper can only train the right Model
2020-11-20 13:30:44 -05:00
13e07da2c8
Node - Add WordLevelTrainer
2020-11-20 13:30:44 -05:00
387b8a1033
Generate pyi, fix tests and clippy warnings
2020-11-20 13:30:44 -05:00
5059be1a8d
Test BPE keeping its options after training
2020-11-20 13:30:44 -05:00
284a1dbee7
PyModel uses a RwLock to allow modifications
2020-11-20 13:30:44 -05:00
54c7210b2f
Train Model in place
...
This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.
2020-11-20 13:30:44 -05:00
224862fe0c
Python - Make the trainer optional on Tokenizer.train
2020-11-20 13:30:44 -05:00