Anthony MOI
f5e9bb89b7
Fix offsets for Precompiled corner case
2021-03-16 15:04:42 -04:00
Anthony MOI
56a9196030
Fix clippy warnings
2021-03-16 12:32:06 -04:00
Anthony MOI
bc8bbf637a
Prepare for python v0.10.1 ( #625 )
2021-02-08 11:45:56 -05:00
Anthony MOI
d96442cbe8
Python - Prepare for release 0.10.1rc1 ( #622 )
2021-02-04 10:37:00 -05:00
Anthony MOI
57200144ca
Python - Fix ByteLevel instantiation from state ( #621 )
2021-02-04 10:16:05 -05:00
Anthony MOI
a8f756494e
Improve Model serialization/deserialization ( #620 )
2021-02-04 09:59:18 -05:00
Anthony MOI
6a29dbc070
Doc - Hotfix training from iterators tutorial
2021-02-03 15:50:09 -05:00
Anthony MOI
db22cb6315
Python - Fix Normalizer.normalize with PyNormalizedStringRefMut
2021-02-03 15:48:53 -05:00
Anthony MOI
355315e8d3
Rust - Fix offsets produced by Precompiled Normalizer
2021-02-03 15:46:45 -05:00
Anthony MOI
96b9972842
Fix SentencePiece tokenizers conversion
2021-02-03 12:44:46 -05:00
Anthony MOI
719bea76b9
Python - Prepare for release 0.10.0
2021-01-12 16:34:04 -05:00
devfon
b9c6bea75e
Add fuse_unk option to SentencePieceBPETokenizer ( #574 )
...
* Add fuse_unk option to SentencePieceBPETokenizer
* Fix style
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com >
2021-01-12 16:07:59 -05:00
Anthony MOI
91dae1de15
Doc - Add documentation for training from iterators
2021-01-12 15:51:38 -05:00
Anthony MOI
cca5d43038
Python - Fix breaking change in Model.save
2021-01-11 16:09:19 -05:00
Anthony MOI
49d11b1f69
Python - Add components getter/setters to BaseTokenizer
2021-01-11 16:08:38 -05:00
Anthony MOI
d94fa220b6
Python - Add train_from_iterator to implementations
2021-01-07 09:02:20 -05:00
Anthony MOI
817c5ad317
Fix clippy warnings for rust 1.49
2021-01-06 15:03:33 -05:00
Anthony MOI
5938a12b3f
Python - Improve training with iterators
2021-01-06 11:38:43 -05:00
Anthony MOI
0c6cc39eee
Python - Update CHANGELOG and bump for release
2020-12-08 13:29:35 -05:00
Tal Perry
8916b6bb27
Add a visualization utility to render tokens and annotations in a notebook ( #508 )
...
* Draft functionality of visualization
* Added comments to make code more intelligble
* polish the styles
* Ensure colors are stable and comment the css
* Code clean up
* Made visualizer importable and added some docs
* Fix styling
* implement comments from PR
* Fixed the regex for UNK tokens and examples in notebook
* Converted docs to google format
* Added a notebook showing multiple languages and tokenizers
* Added visual indication of chars that are tokenized with >1 token
* Reorganize things a bit and fix import
* Update docs
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com >
2020-12-04 10:25:56 -05:00
Anthony MOI
5549fc4837
Python - Update CHANGELOG
2020-11-28 12:42:37 -05:00
Anthony MOI
3a8627ce4d
Improve docs and fix tests around training
2020-11-28 12:29:35 -05:00
Anthony MOI
999067454d
Make sure we first try to extract a string
2020-11-28 12:29:35 -05:00
Anthony MOI
ed9baeabb7
Add example for training with datasets
2020-11-28 12:29:35 -05:00
Anthony MOI
c36ac0bfdf
Improve progress tracking while training
2020-11-28 12:29:35 -05:00
Anthony MOI
75deaecdd0
Also accept iterators of batches in train_from_iterator
2020-11-28 12:29:35 -05:00
Anthony MOI
e0a70f1fb2
Add ability to train from Iterator
2020-11-28 12:29:35 -05:00
Anthony MOI
6e364cb685
Python - Update CHANGELOG and stub files
2020-11-27 17:35:34 -05:00
Anthony MOI
a351d1c604
Python - Trainers can get/set their attributes
2020-11-27 17:35:34 -05:00
Anthony MOI
3eb7ef6d0a
Python - PreTokenizers can get/set their attributes
2020-11-27 17:35:34 -05:00
Anthony MOI
5c35fafc44
Python - Decoders can get/set their attributes
2020-11-27 17:35:34 -05:00
Anthony MOI
091287dcf5
Python - Use macro for getter/setter in models
2020-11-27 17:35:34 -05:00
Anthony MOI
2feccdbbfa
Python - PyStrip can get/set its attributes
2020-11-27 17:35:34 -05:00
Anthony MOI
7512d5e4ce
Python - PyBertNormalizer can get/set its attributes
2020-11-27 17:35:34 -05:00
Anthony MOI
78beae8b7d
Python - PyWordLevel can get/set its attributes
2020-11-27 17:35:34 -05:00
Anthony MOI
760537aad3
Python - PyWordPiece can get/set its attributes
2020-11-27 17:35:34 -05:00
Anthony MOI
c22cfc31f9
Python - PyNormalizer & PyPreTokenizer use a RwLock
2020-11-27 17:35:34 -05:00
Anthony MOI
76d3b2128b
Python - PyBPE can get/set its attributes
2020-11-27 17:35:34 -05:00
Anthony MOI
7f3cfebf45
Python - PyModel uses a RwLock to allow modifications
2020-11-27 17:35:34 -05:00
Patrick von Platen
dd399d2ad0
Split Pre-Tokenizer ( #542 )
...
* start playing around
* make a first version
* refactor
* apply make format
* add python bindings
* add some python binding tests
* correct pre-tokenizers
* update auto-generated bindings
* lint python bindings
* add code node
* add split to docs
* refactor python binding a bit
* cargo fmt
* clippy and fmt in node
* quick updates and fixes
* Oops
* Update node typings
* Update changelog
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com >
2020-11-27 17:07:03 -05:00
Anthony MOI
58e1d8de67
Python - Improve documentation for trainers
2020-11-23 11:52:51 -05:00
Anthony MOI
64441b54b1
Python - Improve documentation for post-processors
2020-11-23 11:52:51 -05:00
Anthony MOI
933a2a9c99
Python - Improve pre-tokenizers docs
2020-11-23 11:52:51 -05:00
Anthony MOI
5842b3db73
Python - Improve normalizers docs
2020-11-23 11:52:51 -05:00
Anthony MOI
c01c301743
Python - Improve documentation for decoders and remove useless kwargs
2020-11-23 11:52:51 -05:00
Anthony MOI
a50d4b7d25
Python - Improve documentation for models
2020-11-23 11:52:51 -05:00
Nick
dc60d4fc0c
Fix BaseTokenizer enable_truncation docstring
2020-11-23 11:28:26 -05:00
Anthony MOI
2fbd6779f6
Make sure TrainerWrapper can only train the right Model
2020-11-20 13:30:44 -05:00
Anthony MOI
13e07da2c8
Node - Add WordLevelTrainer
2020-11-20 13:30:44 -05:00
Anthony MOI
387b8a1033
Generate pyi, fix tests and clippy warnings
2020-11-20 13:30:44 -05:00