1056 Commits

Author SHA1 Message Date
49bd055519 Node - Update bindings with train_from_files 2020-11-28 12:29:35 -05:00
3a8627ce4d Improve docs and fix tests around training 2020-11-28 12:29:35 -05:00
999067454d Make sure we first try to extract a string 2020-11-28 12:29:35 -05:00
ed9baeabb7 Add example for training with datasets 2020-11-28 12:29:35 -05:00
c36ac0bfdf Improve progress tracking while training 2020-11-28 12:29:35 -05:00
75deaecdd0 Also accept iterators of batches in train_from_iterator 2020-11-28 12:29:35 -05:00
e0a70f1fb2 Add ability to train from Iterator 2020-11-28 12:29:35 -05:00
6e364cb685 Python - Update CHANGELOG and stub files 2020-11-27 17:35:34 -05:00
a351d1c604 Python - Trainers can get/set their attributes 2020-11-27 17:35:34 -05:00
3eb7ef6d0a Python - PreTokenizers can get/set their attributes 2020-11-27 17:35:34 -05:00
5c35fafc44 Python - Decoders can get/set their attributes 2020-11-27 17:35:34 -05:00
091287dcf5 Python - Use macro for getter/setter in models 2020-11-27 17:35:34 -05:00
2feccdbbfa Python - PyStrip can get/set its attributes 2020-11-27 17:35:34 -05:00
7512d5e4ce Python - PyBertNormalizer can get/set its attributes 2020-11-27 17:35:34 -05:00
78beae8b7d Python - PyWordLevel can get/set its attributes 2020-11-27 17:35:34 -05:00
760537aad3 Python - PyWordPiece can get/set its attributes 2020-11-27 17:35:34 -05:00
c22cfc31f9 Python - PyNormalizer & PyPreTokenizer use a RwLock 2020-11-27 17:35:34 -05:00
76d3b2128b Python - PyBPE can get/set its attributes 2020-11-27 17:35:34 -05:00
7f3cfebf45 Python - PyModel uses a RwLock to allow modifications 2020-11-27 17:35:34 -05:00
dd399d2ad0 Split Pre-Tokenizer (#542)
* start playing around

* make a first version

* refactor

* apply make format

* add python bindings

* add some python binding tests

* correct pre-tokenizers

* update auto-generated bindings

* lint python bindings

* add code node

* add split to docs

* refactor python binding a bit

* cargo fmt

* clippy and fmt in node

* quick updates and fixes

* Oops

* Update node typings

* Update changelog

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-27 17:07:03 -05:00
58e1d8de67 Python - Improve documentation for trainers 2020-11-23 11:52:51 -05:00
64441b54b1 Python - Improve documentation for post-processors 2020-11-23 11:52:51 -05:00
933a2a9c99 Python - Improve pre-tokenizers docs 2020-11-23 11:52:51 -05:00
5842b3db73 Python - Improve normalizers docs 2020-11-23 11:52:51 -05:00
c01c301743 Python - Improve documentation for decoders and remove useless kwargs 2020-11-23 11:52:51 -05:00
a50d4b7d25 Python - Improve documentation for models 2020-11-23 11:52:51 -05:00
dc60d4fc0c Fix BaseTokenizer enable_truncation docstring 2020-11-23 11:28:26 -05:00
2fbd6779f6 Make sure TrainerWrapper can only train the right Model 2020-11-20 13:30:44 -05:00
13e07da2c8 Node - Add WordLevelTrainer 2020-11-20 13:30:44 -05:00
7fc37a03e8 Node - Trainers train the Model in-place 2020-11-20 13:30:44 -05:00
387b8a1033 Generate pyi, fix tests and clippy warnings 2020-11-20 13:30:44 -05:00
5059be1a8d Test BPE keeping its options after training 2020-11-20 13:30:44 -05:00
284a1dbee7 PyModel uses a RwLock to allow modifications 2020-11-20 13:30:44 -05:00
54c7210b2f Train Model in place
This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.
2020-11-20 13:30:44 -05:00
224862fe0c Python - Make the trainer optional on Tokenizer.train 2020-11-20 13:30:44 -05:00
c230183cf6 A Model can return its associated Trainer 2020-11-20 13:30:44 -05:00
059d43b265 Add WordLevel trainer 2020-11-20 13:30:44 -05:00
cb471d1380 Restore latest stable in rust-toolchain 2020-11-19 16:17:34 -05:00
58b618f98e Python - Update __init__.pyi 2020-11-17 15:28:41 -05:00
352c92ad33 Automatically stubbing the pyi files while keeping inspecting ability (#509)
* First pass on automatic stubbing our python files.

* And now modifying all rust docs to be visible in Pyi files.

* Better assert fail message.

* Fixing github workflow.

* Removing types not exported anymore.

* Fixing `Tokenizer` signature.

* Disabling auto __init__.py.

* Re-enabling some types.

* Don't overwrite non automated __init__.py

* Automated most __init__.py

* Restubbing after rebase.

* Fixing env for tests.

* Install blakc in the env.

* Use PY35 target in stub.py

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-17 15:13:00 -05:00
fff856cff7 New PR to fix #270 (not #157). (#516)
* New PR to fix #270 (not #157).

Reduce drastically the number of required compilation flags.
I think it's good enough for merge right now. We disable progress
altogether when the `progressbar` flag is disabled which is perfectly
fine compared to not being able to build.

Future PR could include.

- Better encapsulation of `progress` in training call sites (less direct
calls to `indicatif` and common code for `setup_progress`, `finalize`
and so on.
- We can have a raw `print` Progress bar when compilation flag is
disabled ?
- Having better control of progressbars in bindings would require use to
change a bunch of code around which might be overkill in the short term.
Either we start by defining a trait for our ProgressBar, and the
bindings can implement the traits with custom `tqdm` and `cli-progress`
(It's not even 100% sure it's doable)
- The easiest way would be to enable some sort of iterator in Rust
  so that calling of progressbars can happen in client code which would
  be the most lenient for all plateforms. The hard part is that
leveraging parallelism in that setting would be hard probably.

* Remove external visibility of progressbar.

* Remove dead import.
2020-11-11 10:51:27 +01:00
b122737ec6 Moving to manylinux2010 and remove nightly on Windows. (#455)
* Moving to manylinux2010 and remove nightly on Windows.

* Add build for manylinux2014 for powerpc and aarch64 + Python v3.9

* Also add support for IBM mainframe

* try with env variables

* Move extra builds to their own workflow

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-09 23:23:07 -05:00
75b41dab0f Python - Update CHANGELOG and bump version for 0.9.4 2020-11-09 16:36:04 -05:00
d3d9f2c76b words -> word_ids & sequences -> sequence_ids 2020-11-09 16:02:07 -05:00
57d162b269 Add an Encoding.sequences to allow masking 2020-11-06 10:41:56 -05:00
385d25720a Simplify the API for Encoding.token_to_XXX 2020-11-06 10:41:56 -05:00
51dbf0b6df Python - Add tests for Encoding 2020-11-06 10:41:56 -05:00
a79cc55e08 Node - Encoding mappings handle sequence_id 2020-11-06 10:41:56 -05:00
dce218ca28 Python - Encoding mappings handle sequence_id 2020-11-06 10:41:56 -05:00
5cccaefcee typo: from_files() renamed to from_file() 2020-11-04 08:15:32 -05:00