tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-03 11:18:29 +00:00

Author	SHA1	Message	Date
Anthony MOI	cb471d1380	Restore latest stable in rust-toolchain	2020-11-19 16:17:34 -05:00
Anthony MOI	58b618f98e	Python - Update __init__.pyi	2020-11-17 15:28:41 -05:00
Nicolas Patry	352c92ad33	Automatically stubbing the `pyi` files while keeping inspecting ability (#509 ) * First pass on automatic stubbing our python files. * And now modifying all rust docs to be visible in Pyi files. * Better assert fail message. * Fixing github workflow. * Removing types not exported anymore. * Fixing `Tokenizer` signature. * Disabling auto __init__.py. * Re-enabling some types. * Don't overwrite non automated __init__.py * Automated most __init__.py * Restubbing after rebase. * Fixing env for tests. * Install blakc in the env. * Use PY35 target in stub.py Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-17 15:13:00 -05:00
Nicolas Patry	fff856cff7	New PR to fix #270 (not #157 ). (#516 ) * New PR to fix #270 (not #157). Reduce drastically the number of required compilation flags. I think it's good enough for merge right now. We disable progress altogether when the `progressbar` flag is disabled which is perfectly fine compared to not being able to build. Future PR could include. - Better encapsulation of `progress` in training call sites (less direct calls to `indicatif` and common code for `setup_progress`, `finalize` and so on. - We can have a raw `print` Progress bar when compilation flag is disabled ? - Having better control of progressbars in bindings would require use to change a bunch of code around which might be overkill in the short term. Either we start by defining a trait for our ProgressBar, and the bindings can implement the traits with custom `tqdm` and `cli-progress` (It's not even 100% sure it's doable) - The easiest way would be to enable some sort of iterator in Rust so that calling of progressbars can happen in client code which would be the most lenient for all plateforms. The hard part is that leveraging parallelism in that setting would be hard probably. * Remove external visibility of progressbar. * Remove dead import.	2020-11-11 10:51:27 +01:00
Nicolas Patry	b122737ec6	Moving to manylinux2010 and remove nightly on Windows. (#455 ) * Moving to manylinux2010 and remove nightly on Windows. * Add build for manylinux2014 for powerpc and aarch64 + Python v3.9 * Also add support for IBM mainframe * try with env variables * Move extra builds to their own workflow Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-09 23:23:07 -05:00
Anthony MOI	75b41dab0f	Python - Update CHANGELOG and bump version for 0.9.4	2020-11-09 16:36:04 -05:00
Anthony MOI	d3d9f2c76b	words -> word_ids & sequences -> sequence_ids	2020-11-09 16:02:07 -05:00
Anthony MOI	57d162b269	Add an Encoding.sequences to allow masking	2020-11-06 10:41:56 -05:00
Anthony MOI	385d25720a	Simplify the API for Encoding.token_to_XXX	2020-11-06 10:41:56 -05:00
Anthony MOI	51dbf0b6df	Python - Add tests for Encoding	2020-11-06 10:41:56 -05:00
Anthony MOI	dce218ca28	Python - Encoding mappings handle sequence_id	2020-11-06 10:41:56 -05:00
taufique74	5cccaefcee	typo: from_files() renamed to from_file()	2020-11-04 08:15:32 -05:00
Mohamed Al Salti	20c7045ba1	Update sentencepiece_unigram.py Update the URL to `sentencepiece_model_pb2.py` in the error message.	2020-11-04 08:08:11 -05:00
Anthony MOI	d788a950ac	Doc - Fixes some CI fails	2020-11-02 17:07:27 -05:00
Anthony MOI	324aa2930a	Doc - Improve python and node tests	2020-11-02 17:07:27 -05:00
Anthony MOI	b6ffd9cba0	Doc - Cleanup old tests & node lints	2020-11-02 17:07:27 -05:00
Anthony MOI	9521603e08	Doc - Update Decoder part of the Pipeline page	2020-11-02 17:07:27 -05:00
Anthony MOI	8b65c1f4bc	Doc - Update Bert example on the Pipeline page	2020-11-02 17:07:27 -05:00
Anthony MOI	620769fd4b	Doc - Update PreTokenizer part of the Pipeline page	2020-11-02 17:07:27 -05:00
Anthony MOI	13a80050f0	Doc - Update Normalizer part of the Pipeline page	2020-11-02 17:07:27 -05:00
Anthony MOI	4cf0a0b72c	Doc - Quicktour uses python tested code	2020-11-02 17:07:27 -05:00
Anthony MOI	d2fc0e4836	Doc - Update API Reference for Encoding	2020-11-02 17:07:27 -05:00
Anthony MOI	a86d49634c	Doc - API Reference for most Tokenizer methods/attributes	2020-11-02 17:07:27 -05:00
Anthony MOI	8c0370657e	Doc - Update API Reference on more Tokenizer methods	2020-11-02 17:07:27 -05:00
Anthony MOI	ddabe130cd	Doc - Updated API Reference for AddedToken	2020-11-02 17:07:27 -05:00
Anthony MOI	79f02bb7f0	Doc - Updated API Reference for encode/encode_batch	2020-11-02 17:07:27 -05:00
Anthony MOI	3ee54766e3	Doc - Backbone for API Reference	2020-11-02 17:07:27 -05:00
Anthony MOI	000c19a7a5	Doc - Improve snippets testing	2020-11-02 17:07:27 -05:00
Anthony MOI	e865b7cd7c	Customize the doc for each language	2020-11-02 17:07:27 -05:00
Nicolas Patry	655809c718	Attempt to get some documentation going.	2020-11-02 17:07:27 -05:00
taufique74	4929809af0	makes from_file() method static	2020-11-01 13:15:15 -05:00
Anthony MOI	991128f9e1	Node - Fix models init methods & add WordLevel	2020-10-30 13:47:04 -04:00
Anthony MOI	2364d376f7	Python - Update CHANGELOG and bump to 0.9.3 for release	2020-10-26 16:40:24 -04:00
Anthony MOI	466f5303eb	Fix UnigramTrainer	2020-10-26 16:31:58 -04:00
Anthony MOI	73b5da917f	Unigram - Add special_tokens at the end of training + optional unk	2020-10-26 10:57:29 -04:00
Anthony MOI	1a6f4b5204	Allow initial_alphabet on UnigramTrainer	2020-10-26 10:57:29 -04:00
Timur Ganiev	f7c61c267a	Fixed `BPE.read_files` -> `BPE.read_file` in SentencePieceBPETokenizer	2020-10-26 10:57:14 -04:00
Anthony MOI	a2289d49b4	Finish exposing the UnicodeScripts PreTokenizer	2020-10-21 11:01:54 -04:00
Nicolas Patry	180371d929	Fixing hanging error while acquiring GIL from custom pretokenizer during training. (#470 ) * Fixing hanging error while acquiring GIL from custom pretokenizer during training. Fixes #469 * cleanup Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-10-20 14:23:39 -04:00
Anthony MOI	91f602f744	Python - Update CHANGELOG and bump to 0.9.2 for release	2020-10-15 10:14:58 -04:00
Nicolas Patry	2ccd16bf5c	Adding a new tests for `PreTokenizer.custom`. This example is more illustrative of what's doable for custom PreTokenizer.	2020-10-15 10:07:48 -04:00
Anthony MOI	f94a274702	Python - Update CHANGELOG and bump version for release	2020-10-13 14:45:21 -04:00
Nicolas Patry	88556790e7	Fixing a bug where long tokenizer files would be incorrectly deserialized (#459 ) * Fixing a bug where long tokenizer files would be incorrectly deserialized - Add a bunch of tests to check deserialization behaviour - One tests also confirms current Single deserialization of Sequence. * Better test locations for Windows + no file dependency in Python binding Rust side. * Adressing @n1t0 comments.	2020-10-13 18:44:24 +02:00
Anthony MOI	3bb794681c	Python - Use 1.46.0 for now	2020-10-09 13:40:35 -04:00
Anthony MOI	83e11a8de4	Python - Update dependencies for release	2020-10-09 13:09:35 -04:00
Anthony MOI	4f4ba4a11a	Python - Bump version for 0.9.0 release	2020-10-09 13:00:19 -04:00
Nicolas Patry	fbca797b3d	Fixing Trainer with u8 instead of chars. (#452 ) * Fixing Trainer with u8 instead of chars. Now check both optimized and unoptimized encodings schemes for Unigram. * Small fixes. * Fixing makefile.	2020-10-09 18:57:14 +02:00
Nicolas Patry	dd9fda5d05	Bump rc version.	2020-10-06 11:04:36 +02:00
Anthony MOI	aebf510c5a	Python - Update CHANGELOG and bump to 0.9.0.rc1	2020-09-29 10:24:24 -04:00
Anthony MOI	ff57504972	Python - Add some more test for TemplateProcessing	2020-09-29 10:09:10 -04:00

1 2 3 4 5 ...

539 Commits