Commit Graph

584 Commits

Author SHA1 Message Date
devfon
b9c6bea75e Add fuse_unk option to SentencePieceBPETokenizer (#574)
* Add fuse_unk option to SentencePieceBPETokenizer

* Fix style

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-01-12 16:07:59 -05:00
Anthony MOI
91dae1de15 Doc - Add documentation for training from iterators 2021-01-12 15:51:38 -05:00
Anthony MOI
cca5d43038 Python - Fix breaking change in Model.save 2021-01-11 16:09:19 -05:00
Anthony MOI
49d11b1f69 Python - Add components getter/setters to BaseTokenizer 2021-01-11 16:08:38 -05:00
Anthony MOI
d94fa220b6 Python - Add train_from_iterator to implementations 2021-01-07 09:02:20 -05:00
Anthony MOI
817c5ad317 Fix clippy warnings for rust 1.49 2021-01-06 15:03:33 -05:00
Anthony MOI
5938a12b3f Python - Improve training with iterators 2021-01-06 11:38:43 -05:00
Anthony MOI
0c6cc39eee Python - Update CHANGELOG and bump for release 2020-12-08 13:29:35 -05:00
Tal Perry
8916b6bb27 Add a visualization utility to render tokens and annotations in a notebook (#508)
* Draft functionality of visualization

* Added comments to make code more intelligble

* polish the styles

* Ensure colors are stable and comment the css

* Code clean up

* Made visualizer importable and added some docs

* Fix styling

* implement comments from PR

* Fixed the regex for UNK tokens and examples in notebook

* Converted docs to google format

* Added a notebook showing multiple languages and tokenizers

* Added visual indication of chars that are tokenized with >1 token

* Reorganize things a bit and fix import

* Update docs

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-12-04 10:25:56 -05:00
Anthony MOI
5549fc4837 Python - Update CHANGELOG 2020-11-28 12:42:37 -05:00
Anthony MOI
3a8627ce4d Improve docs and fix tests around training 2020-11-28 12:29:35 -05:00
Anthony MOI
999067454d Make sure we first try to extract a string 2020-11-28 12:29:35 -05:00
Anthony MOI
ed9baeabb7 Add example for training with datasets 2020-11-28 12:29:35 -05:00
Anthony MOI
c36ac0bfdf Improve progress tracking while training 2020-11-28 12:29:35 -05:00
Anthony MOI
75deaecdd0 Also accept iterators of batches in train_from_iterator 2020-11-28 12:29:35 -05:00
Anthony MOI
e0a70f1fb2 Add ability to train from Iterator 2020-11-28 12:29:35 -05:00
Anthony MOI
6e364cb685 Python - Update CHANGELOG and stub files 2020-11-27 17:35:34 -05:00
Anthony MOI
a351d1c604 Python - Trainers can get/set their attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
3eb7ef6d0a Python - PreTokenizers can get/set their attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
5c35fafc44 Python - Decoders can get/set their attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
091287dcf5 Python - Use macro for getter/setter in models 2020-11-27 17:35:34 -05:00
Anthony MOI
2feccdbbfa Python - PyStrip can get/set its attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
7512d5e4ce Python - PyBertNormalizer can get/set its attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
78beae8b7d Python - PyWordLevel can get/set its attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
760537aad3 Python - PyWordPiece can get/set its attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
c22cfc31f9 Python - PyNormalizer & PyPreTokenizer use a RwLock 2020-11-27 17:35:34 -05:00
Anthony MOI
76d3b2128b Python - PyBPE can get/set its attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
7f3cfebf45 Python - PyModel uses a RwLock to allow modifications 2020-11-27 17:35:34 -05:00
Patrick von Platen
dd399d2ad0 Split Pre-Tokenizer (#542)
* start playing around

* make a first version

* refactor

* apply make format

* add python bindings

* add some python binding tests

* correct pre-tokenizers

* update auto-generated bindings

* lint python bindings

* add code node

* add split to docs

* refactor python binding a bit

* cargo fmt

* clippy and fmt in node

* quick updates and fixes

* Oops

* Update node typings

* Update changelog

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-27 17:07:03 -05:00
Anthony MOI
58e1d8de67 Python - Improve documentation for trainers 2020-11-23 11:52:51 -05:00
Anthony MOI
64441b54b1 Python - Improve documentation for post-processors 2020-11-23 11:52:51 -05:00
Anthony MOI
933a2a9c99 Python - Improve pre-tokenizers docs 2020-11-23 11:52:51 -05:00
Anthony MOI
5842b3db73 Python - Improve normalizers docs 2020-11-23 11:52:51 -05:00
Anthony MOI
c01c301743 Python - Improve documentation for decoders and remove useless kwargs 2020-11-23 11:52:51 -05:00
Anthony MOI
a50d4b7d25 Python - Improve documentation for models 2020-11-23 11:52:51 -05:00
Nick
dc60d4fc0c Fix BaseTokenizer enable_truncation docstring 2020-11-23 11:28:26 -05:00
Anthony MOI
2fbd6779f6 Make sure TrainerWrapper can only train the right Model 2020-11-20 13:30:44 -05:00
Anthony MOI
13e07da2c8 Node - Add WordLevelTrainer 2020-11-20 13:30:44 -05:00
Anthony MOI
387b8a1033 Generate pyi, fix tests and clippy warnings 2020-11-20 13:30:44 -05:00
Anthony MOI
5059be1a8d Test BPE keeping its options after training 2020-11-20 13:30:44 -05:00
Anthony MOI
284a1dbee7 PyModel uses a RwLock to allow modifications 2020-11-20 13:30:44 -05:00
Anthony MOI
54c7210b2f Train Model in place
This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.
2020-11-20 13:30:44 -05:00
Anthony MOI
224862fe0c Python - Make the trainer optional on Tokenizer.train 2020-11-20 13:30:44 -05:00
Anthony MOI
c230183cf6 A Model can return its associated Trainer 2020-11-20 13:30:44 -05:00
Anthony MOI
059d43b265 Add WordLevel trainer 2020-11-20 13:30:44 -05:00
Anthony MOI
cb471d1380 Restore latest stable in rust-toolchain 2020-11-19 16:17:34 -05:00
Anthony MOI
58b618f98e Python - Update __init__.pyi 2020-11-17 15:28:41 -05:00
Nicolas Patry
352c92ad33 Automatically stubbing the pyi files while keeping inspecting ability (#509)
* First pass on automatic stubbing our python files.

* And now modifying all rust docs to be visible in Pyi files.

* Better assert fail message.

* Fixing github workflow.

* Removing types not exported anymore.

* Fixing `Tokenizer` signature.

* Disabling auto __init__.py.

* Re-enabling some types.

* Don't overwrite non automated __init__.py

* Automated most __init__.py

* Restubbing after rebase.

* Fixing env for tests.

* Install blakc in the env.

* Use PY35 target in stub.py

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-17 15:13:00 -05:00
Nicolas Patry
fff856cff7 New PR to fix #270 (not #157). (#516)
* New PR to fix #270 (not #157).

Reduce drastically the number of required compilation flags.
I think it's good enough for merge right now. We disable progress
altogether when the `progressbar` flag is disabled which is perfectly
fine compared to not being able to build.

Future PR could include.

- Better encapsulation of `progress` in training call sites (less direct
calls to `indicatif` and common code for `setup_progress`, `finalize`
and so on.
- We can have a raw `print` Progress bar when compilation flag is
disabled ?
- Having better control of progressbars in bindings would require use to
change a bunch of code around which might be overkill in the short term.
Either we start by defining a trait for our ProgressBar, and the
bindings can implement the traits with custom `tqdm` and `cli-progress`
(It's not even 100% sure it's doable)
- The easiest way would be to enable some sort of iterator in Rust
  so that calling of progressbars can happen in client code which would
  be the most lenient for all plateforms. The hard part is that
leveraging parallelism in that setting would be hard probably.

* Remove external visibility of progressbar.

* Remove dead import.
2020-11-11 10:51:27 +01:00
Nicolas Patry
b122737ec6 Moving to manylinux2010 and remove nightly on Windows. (#455)
* Moving to manylinux2010 and remove nightly on Windows.

* Add build for manylinux2014 for powerpc and aarch64 + Python v3.9

* Also add support for IBM mainframe

* try with env variables

* Move extra builds to their own workflow

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-09 23:23:07 -05:00