Commit Graph

561 Commits

Author SHA1 Message Date
78beae8b7d Python - PyWordLevel can get/set its attributes 2020-11-27 17:35:34 -05:00
760537aad3 Python - PyWordPiece can get/set its attributes 2020-11-27 17:35:34 -05:00
c22cfc31f9 Python - PyNormalizer & PyPreTokenizer use a RwLock 2020-11-27 17:35:34 -05:00
76d3b2128b Python - PyBPE can get/set its attributes 2020-11-27 17:35:34 -05:00
7f3cfebf45 Python - PyModel uses a RwLock to allow modifications 2020-11-27 17:35:34 -05:00
dd399d2ad0 Split Pre-Tokenizer (#542)
* start playing around

* make a first version

* refactor

* apply make format

* add python bindings

* add some python binding tests

* correct pre-tokenizers

* update auto-generated bindings

* lint python bindings

* add code node

* add split to docs

* refactor python binding a bit

* cargo fmt

* clippy and fmt in node

* quick updates and fixes

* Oops

* Update node typings

* Update changelog

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-27 17:07:03 -05:00
58e1d8de67 Python - Improve documentation for trainers 2020-11-23 11:52:51 -05:00
64441b54b1 Python - Improve documentation for post-processors 2020-11-23 11:52:51 -05:00
933a2a9c99 Python - Improve pre-tokenizers docs 2020-11-23 11:52:51 -05:00
5842b3db73 Python - Improve normalizers docs 2020-11-23 11:52:51 -05:00
c01c301743 Python - Improve documentation for decoders and remove useless kwargs 2020-11-23 11:52:51 -05:00
a50d4b7d25 Python - Improve documentation for models 2020-11-23 11:52:51 -05:00
dc60d4fc0c Fix BaseTokenizer enable_truncation docstring 2020-11-23 11:28:26 -05:00
2fbd6779f6 Make sure TrainerWrapper can only train the right Model 2020-11-20 13:30:44 -05:00
13e07da2c8 Node - Add WordLevelTrainer 2020-11-20 13:30:44 -05:00
387b8a1033 Generate pyi, fix tests and clippy warnings 2020-11-20 13:30:44 -05:00
5059be1a8d Test BPE keeping its options after training 2020-11-20 13:30:44 -05:00
284a1dbee7 PyModel uses a RwLock to allow modifications 2020-11-20 13:30:44 -05:00
54c7210b2f Train Model in place
This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.
2020-11-20 13:30:44 -05:00
224862fe0c Python - Make the trainer optional on Tokenizer.train 2020-11-20 13:30:44 -05:00
c230183cf6 A Model can return its associated Trainer 2020-11-20 13:30:44 -05:00
059d43b265 Add WordLevel trainer 2020-11-20 13:30:44 -05:00
cb471d1380 Restore latest stable in rust-toolchain 2020-11-19 16:17:34 -05:00
58b618f98e Python - Update __init__.pyi 2020-11-17 15:28:41 -05:00
352c92ad33 Automatically stubbing the pyi files while keeping inspecting ability (#509)
* First pass on automatic stubbing our python files.

* And now modifying all rust docs to be visible in Pyi files.

* Better assert fail message.

* Fixing github workflow.

* Removing types not exported anymore.

* Fixing `Tokenizer` signature.

* Disabling auto __init__.py.

* Re-enabling some types.

* Don't overwrite non automated __init__.py

* Automated most __init__.py

* Restubbing after rebase.

* Fixing env for tests.

* Install blakc in the env.

* Use PY35 target in stub.py

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-17 15:13:00 -05:00
fff856cff7 New PR to fix #270 (not #157). (#516)
* New PR to fix #270 (not #157).

Reduce drastically the number of required compilation flags.
I think it's good enough for merge right now. We disable progress
altogether when the `progressbar` flag is disabled which is perfectly
fine compared to not being able to build.

Future PR could include.

- Better encapsulation of `progress` in training call sites (less direct
calls to `indicatif` and common code for `setup_progress`, `finalize`
and so on.
- We can have a raw `print` Progress bar when compilation flag is
disabled ?
- Having better control of progressbars in bindings would require use to
change a bunch of code around which might be overkill in the short term.
Either we start by defining a trait for our ProgressBar, and the
bindings can implement the traits with custom `tqdm` and `cli-progress`
(It's not even 100% sure it's doable)
- The easiest way would be to enable some sort of iterator in Rust
  so that calling of progressbars can happen in client code which would
  be the most lenient for all plateforms. The hard part is that
leveraging parallelism in that setting would be hard probably.

* Remove external visibility of progressbar.

* Remove dead import.
2020-11-11 10:51:27 +01:00
b122737ec6 Moving to manylinux2010 and remove nightly on Windows. (#455)
* Moving to manylinux2010 and remove nightly on Windows.

* Add build for manylinux2014 for powerpc and aarch64 + Python v3.9

* Also add support for IBM mainframe

* try with env variables

* Move extra builds to their own workflow

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-09 23:23:07 -05:00
75b41dab0f Python - Update CHANGELOG and bump version for 0.9.4 2020-11-09 16:36:04 -05:00
d3d9f2c76b words -> word_ids & sequences -> sequence_ids 2020-11-09 16:02:07 -05:00
57d162b269 Add an Encoding.sequences to allow masking 2020-11-06 10:41:56 -05:00
385d25720a Simplify the API for Encoding.token_to_XXX 2020-11-06 10:41:56 -05:00
51dbf0b6df Python - Add tests for Encoding 2020-11-06 10:41:56 -05:00
dce218ca28 Python - Encoding mappings handle sequence_id 2020-11-06 10:41:56 -05:00
5cccaefcee typo: from_files() renamed to from_file() 2020-11-04 08:15:32 -05:00
20c7045ba1 Update sentencepiece_unigram.py
Update the URL to `sentencepiece_model_pb2.py` in the error message.
2020-11-04 08:08:11 -05:00
d788a950ac Doc - Fixes some CI fails 2020-11-02 17:07:27 -05:00
324aa2930a Doc - Improve python and node tests 2020-11-02 17:07:27 -05:00
b6ffd9cba0 Doc - Cleanup old tests & node lints 2020-11-02 17:07:27 -05:00
9521603e08 Doc - Update Decoder part of the Pipeline page 2020-11-02 17:07:27 -05:00
8b65c1f4bc Doc - Update Bert example on the Pipeline page 2020-11-02 17:07:27 -05:00
620769fd4b Doc - Update PreTokenizer part of the Pipeline page 2020-11-02 17:07:27 -05:00
13a80050f0 Doc - Update Normalizer part of the Pipeline page 2020-11-02 17:07:27 -05:00
4cf0a0b72c Doc - Quicktour uses python tested code 2020-11-02 17:07:27 -05:00
d2fc0e4836 Doc - Update API Reference for Encoding 2020-11-02 17:07:27 -05:00
a86d49634c Doc - API Reference for most Tokenizer methods/attributes 2020-11-02 17:07:27 -05:00
8c0370657e Doc - Update API Reference on more Tokenizer methods 2020-11-02 17:07:27 -05:00
ddabe130cd Doc - Updated API Reference for AddedToken 2020-11-02 17:07:27 -05:00
79f02bb7f0 Doc - Updated API Reference for encode/encode_batch 2020-11-02 17:07:27 -05:00
3ee54766e3 Doc - Backbone for API Reference 2020-11-02 17:07:27 -05:00
000c19a7a5 Doc - Improve snippets testing 2020-11-02 17:07:27 -05:00