Commit Graph

539 Commits

Author SHA1 Message Date
Anthony MOI
cb471d1380 Restore latest stable in rust-toolchain 2020-11-19 16:17:34 -05:00
Anthony MOI
58b618f98e Python - Update __init__.pyi 2020-11-17 15:28:41 -05:00
Nicolas Patry
352c92ad33 Automatically stubbing the pyi files while keeping inspecting ability (#509)
* First pass on automatic stubbing our python files.

* And now modifying all rust docs to be visible in Pyi files.

* Better assert fail message.

* Fixing github workflow.

* Removing types not exported anymore.

* Fixing `Tokenizer` signature.

* Disabling auto __init__.py.

* Re-enabling some types.

* Don't overwrite non automated __init__.py

* Automated most __init__.py

* Restubbing after rebase.

* Fixing env for tests.

* Install blakc in the env.

* Use PY35 target in stub.py

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-17 15:13:00 -05:00
Nicolas Patry
fff856cff7 New PR to fix #270 (not #157). (#516)
* New PR to fix #270 (not #157).

Reduce drastically the number of required compilation flags.
I think it's good enough for merge right now. We disable progress
altogether when the `progressbar` flag is disabled which is perfectly
fine compared to not being able to build.

Future PR could include.

- Better encapsulation of `progress` in training call sites (less direct
calls to `indicatif` and common code for `setup_progress`, `finalize`
and so on.
- We can have a raw `print` Progress bar when compilation flag is
disabled ?
- Having better control of progressbars in bindings would require use to
change a bunch of code around which might be overkill in the short term.
Either we start by defining a trait for our ProgressBar, and the
bindings can implement the traits with custom `tqdm` and `cli-progress`
(It's not even 100% sure it's doable)
- The easiest way would be to enable some sort of iterator in Rust
  so that calling of progressbars can happen in client code which would
  be the most lenient for all plateforms. The hard part is that
leveraging parallelism in that setting would be hard probably.

* Remove external visibility of progressbar.

* Remove dead import.
2020-11-11 10:51:27 +01:00
Nicolas Patry
b122737ec6 Moving to manylinux2010 and remove nightly on Windows. (#455)
* Moving to manylinux2010 and remove nightly on Windows.

* Add build for manylinux2014 for powerpc and aarch64 + Python v3.9

* Also add support for IBM mainframe

* try with env variables

* Move extra builds to their own workflow

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-09 23:23:07 -05:00
Anthony MOI
75b41dab0f Python - Update CHANGELOG and bump version for 0.9.4 2020-11-09 16:36:04 -05:00
Anthony MOI
d3d9f2c76b words -> word_ids & sequences -> sequence_ids 2020-11-09 16:02:07 -05:00
Anthony MOI
57d162b269 Add an Encoding.sequences to allow masking 2020-11-06 10:41:56 -05:00
Anthony MOI
385d25720a Simplify the API for Encoding.token_to_XXX 2020-11-06 10:41:56 -05:00
Anthony MOI
51dbf0b6df Python - Add tests for Encoding 2020-11-06 10:41:56 -05:00
Anthony MOI
dce218ca28 Python - Encoding mappings handle sequence_id 2020-11-06 10:41:56 -05:00
taufique74
5cccaefcee typo: from_files() renamed to from_file() 2020-11-04 08:15:32 -05:00
Mohamed Al Salti
20c7045ba1 Update sentencepiece_unigram.py
Update the URL to `sentencepiece_model_pb2.py` in the error message.
2020-11-04 08:08:11 -05:00
Anthony MOI
d788a950ac Doc - Fixes some CI fails 2020-11-02 17:07:27 -05:00
Anthony MOI
324aa2930a Doc - Improve python and node tests 2020-11-02 17:07:27 -05:00
Anthony MOI
b6ffd9cba0 Doc - Cleanup old tests & node lints 2020-11-02 17:07:27 -05:00
Anthony MOI
9521603e08 Doc - Update Decoder part of the Pipeline page 2020-11-02 17:07:27 -05:00
Anthony MOI
8b65c1f4bc Doc - Update Bert example on the Pipeline page 2020-11-02 17:07:27 -05:00
Anthony MOI
620769fd4b Doc - Update PreTokenizer part of the Pipeline page 2020-11-02 17:07:27 -05:00
Anthony MOI
13a80050f0 Doc - Update Normalizer part of the Pipeline page 2020-11-02 17:07:27 -05:00
Anthony MOI
4cf0a0b72c Doc - Quicktour uses python tested code 2020-11-02 17:07:27 -05:00
Anthony MOI
d2fc0e4836 Doc - Update API Reference for Encoding 2020-11-02 17:07:27 -05:00
Anthony MOI
a86d49634c Doc - API Reference for most Tokenizer methods/attributes 2020-11-02 17:07:27 -05:00
Anthony MOI
8c0370657e Doc - Update API Reference on more Tokenizer methods 2020-11-02 17:07:27 -05:00
Anthony MOI
ddabe130cd Doc - Updated API Reference for AddedToken 2020-11-02 17:07:27 -05:00
Anthony MOI
79f02bb7f0 Doc - Updated API Reference for encode/encode_batch 2020-11-02 17:07:27 -05:00
Anthony MOI
3ee54766e3 Doc - Backbone for API Reference 2020-11-02 17:07:27 -05:00
Anthony MOI
000c19a7a5 Doc - Improve snippets testing 2020-11-02 17:07:27 -05:00
Anthony MOI
e865b7cd7c Customize the doc for each language 2020-11-02 17:07:27 -05:00
Nicolas Patry
655809c718 Attempt to get some documentation going. 2020-11-02 17:07:27 -05:00
taufique74
4929809af0 makes from_file() method static 2020-11-01 13:15:15 -05:00
Anthony MOI
991128f9e1 Node - Fix models init methods & add WordLevel 2020-10-30 13:47:04 -04:00
Anthony MOI
2364d376f7 Python - Update CHANGELOG and bump to 0.9.3 for release 2020-10-26 16:40:24 -04:00
Anthony MOI
466f5303eb Fix UnigramTrainer 2020-10-26 16:31:58 -04:00
Anthony MOI
73b5da917f Unigram - Add special_tokens at the end of training + optional unk 2020-10-26 10:57:29 -04:00
Anthony MOI
1a6f4b5204 Allow initial_alphabet on UnigramTrainer 2020-10-26 10:57:29 -04:00
Timur Ganiev
f7c61c267a Fixed BPE.read_files -> BPE.read_file in SentencePieceBPETokenizer 2020-10-26 10:57:14 -04:00
Anthony MOI
a2289d49b4 Finish exposing the UnicodeScripts PreTokenizer 2020-10-21 11:01:54 -04:00
Nicolas Patry
180371d929 Fixing hanging error while acquiring GIL from custom pretokenizer during training. (#470)
* Fixing hanging error while acquiring GIL from custom pretokenizer
during training.

Fixes #469

* cleanup

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-10-20 14:23:39 -04:00
Anthony MOI
91f602f744 Python - Update CHANGELOG and bump to 0.9.2 for release 2020-10-15 10:14:58 -04:00
Nicolas Patry
2ccd16bf5c Adding a new tests for PreTokenizer.custom.
This example is more illustrative of what's doable for custom
PreTokenizer.
2020-10-15 10:07:48 -04:00
Anthony MOI
f94a274702 Python - Update CHANGELOG and bump version for release 2020-10-13 14:45:21 -04:00
Nicolas Patry
88556790e7 Fixing a bug where long tokenizer files would be incorrectly deserialized (#459)
* Fixing a bug where long tokenizer files would be incorrectly
deserialized

- Add a bunch of tests to check deserialization behaviour
- One tests also confirms current Single deserialization of Sequence.

* Better test locations for Windows + no file dependency in Python binding
Rust side.

* Adressing @n1t0 comments.
2020-10-13 18:44:24 +02:00
Anthony MOI
3bb794681c Python - Use 1.46.0 for now 2020-10-09 13:40:35 -04:00
Anthony MOI
83e11a8de4 Python - Update dependencies for release 2020-10-09 13:09:35 -04:00
Anthony MOI
4f4ba4a11a Python - Bump version for 0.9.0 release 2020-10-09 13:00:19 -04:00
Nicolas Patry
fbca797b3d Fixing Trainer with u8 instead of chars. (#452)
* Fixing Trainer with u8 instead of chars.

Now check both optimized and unoptimized encodings schemes for Unigram.

* Small fixes.

* Fixing makefile.
2020-10-09 18:57:14 +02:00
Nicolas Patry
dd9fda5d05 Bump rc version. 2020-10-06 11:04:36 +02:00
Anthony MOI
aebf510c5a Python - Update CHANGELOG and bump to 0.9.0.rc1 2020-09-29 10:24:24 -04:00
Anthony MOI
ff57504972 Python - Add some more test for TemplateProcessing 2020-09-29 10:09:10 -04:00