tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-09-02 23:39:14 +00:00

Author	SHA1	Message	Date
Anthony MOI	78beae8b7d	Python - PyWordLevel can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	760537aad3	Python - PyWordPiece can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	c22cfc31f9	Python - PyNormalizer & PyPreTokenizer use a RwLock	2020-11-27 17:35:34 -05:00
Anthony MOI	76d3b2128b	Python - PyBPE can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	7f3cfebf45	Python - PyModel uses a RwLock to allow modifications	2020-11-27 17:35:34 -05:00
Patrick von Platen	dd399d2ad0	Split Pre-Tokenizer (#542 ) * start playing around * make a first version * refactor * apply make format * add python bindings * add some python binding tests * correct pre-tokenizers * update auto-generated bindings * lint python bindings * add code node * add split to docs * refactor python binding a bit * cargo fmt * clippy and fmt in node * quick updates and fixes * Oops * Update node typings * Update changelog Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-27 17:07:03 -05:00
Anthony MOI	58e1d8de67	Python - Improve documentation for trainers	2020-11-23 11:52:51 -05:00
Anthony MOI	64441b54b1	Python - Improve documentation for post-processors	2020-11-23 11:52:51 -05:00
Anthony MOI	933a2a9c99	Python - Improve pre-tokenizers docs	2020-11-23 11:52:51 -05:00
Anthony MOI	5842b3db73	Python - Improve normalizers docs	2020-11-23 11:52:51 -05:00
Anthony MOI	c01c301743	Python - Improve documentation for decoders and remove useless kwargs	2020-11-23 11:52:51 -05:00
Anthony MOI	a50d4b7d25	Python - Improve documentation for models	2020-11-23 11:52:51 -05:00
Nick	dc60d4fc0c	Fix BaseTokenizer enable_truncation docstring	2020-11-23 11:28:26 -05:00
Anthony MOI	2fbd6779f6	Make sure TrainerWrapper can only train the right Model	2020-11-20 13:30:44 -05:00
Anthony MOI	13e07da2c8	Node - Add WordLevelTrainer	2020-11-20 13:30:44 -05:00
Anthony MOI	387b8a1033	Generate pyi, fix tests and clippy warnings	2020-11-20 13:30:44 -05:00
Anthony MOI	5059be1a8d	Test BPE keeping its options after training	2020-11-20 13:30:44 -05:00
Anthony MOI	284a1dbee7	PyModel uses a RwLock to allow modifications	2020-11-20 13:30:44 -05:00
Anthony MOI	54c7210b2f	Train Model in place This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.	2020-11-20 13:30:44 -05:00
Anthony MOI	224862fe0c	Python - Make the trainer optional on Tokenizer.train	2020-11-20 13:30:44 -05:00
Anthony MOI	c230183cf6	A Model can return its associated Trainer	2020-11-20 13:30:44 -05:00
Anthony MOI	059d43b265	Add WordLevel trainer	2020-11-20 13:30:44 -05:00
Anthony MOI	cb471d1380	Restore latest stable in rust-toolchain	2020-11-19 16:17:34 -05:00
Anthony MOI	58b618f98e	Python - Update __init__.pyi	2020-11-17 15:28:41 -05:00
Nicolas Patry	352c92ad33	Automatically stubbing the `pyi` files while keeping inspecting ability (#509 ) * First pass on automatic stubbing our python files. * And now modifying all rust docs to be visible in Pyi files. * Better assert fail message. * Fixing github workflow. * Removing types not exported anymore. * Fixing `Tokenizer` signature. * Disabling auto __init__.py. * Re-enabling some types. * Don't overwrite non automated __init__.py * Automated most __init__.py * Restubbing after rebase. * Fixing env for tests. * Install blakc in the env. * Use PY35 target in stub.py Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-17 15:13:00 -05:00
Nicolas Patry	fff856cff7	New PR to fix #270 (not #157 ). (#516 ) * New PR to fix #270 (not #157). Reduce drastically the number of required compilation flags. I think it's good enough for merge right now. We disable progress altogether when the `progressbar` flag is disabled which is perfectly fine compared to not being able to build. Future PR could include. - Better encapsulation of `progress` in training call sites (less direct calls to `indicatif` and common code for `setup_progress`, `finalize` and so on. - We can have a raw `print` Progress bar when compilation flag is disabled ? - Having better control of progressbars in bindings would require use to change a bunch of code around which might be overkill in the short term. Either we start by defining a trait for our ProgressBar, and the bindings can implement the traits with custom `tqdm` and `cli-progress` (It's not even 100% sure it's doable) - The easiest way would be to enable some sort of iterator in Rust so that calling of progressbars can happen in client code which would be the most lenient for all plateforms. The hard part is that leveraging parallelism in that setting would be hard probably. * Remove external visibility of progressbar. * Remove dead import.	2020-11-11 10:51:27 +01:00
Nicolas Patry	b122737ec6	Moving to manylinux2010 and remove nightly on Windows. (#455 ) * Moving to manylinux2010 and remove nightly on Windows. * Add build for manylinux2014 for powerpc and aarch64 + Python v3.9 * Also add support for IBM mainframe * try with env variables * Move extra builds to their own workflow Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-09 23:23:07 -05:00
Anthony MOI	75b41dab0f	Python - Update CHANGELOG and bump version for 0.9.4	2020-11-09 16:36:04 -05:00
Anthony MOI	d3d9f2c76b	words -> word_ids & sequences -> sequence_ids	2020-11-09 16:02:07 -05:00
Anthony MOI	57d162b269	Add an Encoding.sequences to allow masking	2020-11-06 10:41:56 -05:00
Anthony MOI	385d25720a	Simplify the API for Encoding.token_to_XXX	2020-11-06 10:41:56 -05:00
Anthony MOI	51dbf0b6df	Python - Add tests for Encoding	2020-11-06 10:41:56 -05:00
Anthony MOI	dce218ca28	Python - Encoding mappings handle sequence_id	2020-11-06 10:41:56 -05:00
taufique74	5cccaefcee	typo: from_files() renamed to from_file()	2020-11-04 08:15:32 -05:00
Mohamed Al Salti	20c7045ba1	Update sentencepiece_unigram.py Update the URL to `sentencepiece_model_pb2.py` in the error message.	2020-11-04 08:08:11 -05:00
Anthony MOI	d788a950ac	Doc - Fixes some CI fails	2020-11-02 17:07:27 -05:00
Anthony MOI	324aa2930a	Doc - Improve python and node tests	2020-11-02 17:07:27 -05:00
Anthony MOI	b6ffd9cba0	Doc - Cleanup old tests & node lints	2020-11-02 17:07:27 -05:00
Anthony MOI	9521603e08	Doc - Update Decoder part of the Pipeline page	2020-11-02 17:07:27 -05:00
Anthony MOI	8b65c1f4bc	Doc - Update Bert example on the Pipeline page	2020-11-02 17:07:27 -05:00
Anthony MOI	620769fd4b	Doc - Update PreTokenizer part of the Pipeline page	2020-11-02 17:07:27 -05:00
Anthony MOI	13a80050f0	Doc - Update Normalizer part of the Pipeline page	2020-11-02 17:07:27 -05:00
Anthony MOI	4cf0a0b72c	Doc - Quicktour uses python tested code	2020-11-02 17:07:27 -05:00
Anthony MOI	d2fc0e4836	Doc - Update API Reference for Encoding	2020-11-02 17:07:27 -05:00
Anthony MOI	a86d49634c	Doc - API Reference for most Tokenizer methods/attributes	2020-11-02 17:07:27 -05:00
Anthony MOI	8c0370657e	Doc - Update API Reference on more Tokenizer methods	2020-11-02 17:07:27 -05:00
Anthony MOI	ddabe130cd	Doc - Updated API Reference for AddedToken	2020-11-02 17:07:27 -05:00
Anthony MOI	79f02bb7f0	Doc - Updated API Reference for encode/encode_batch	2020-11-02 17:07:27 -05:00
Anthony MOI	3ee54766e3	Doc - Backbone for API Reference	2020-11-02 17:07:27 -05:00
Anthony MOI	000c19a7a5	Doc - Improve snippets testing	2020-11-02 17:07:27 -05:00

1 2 3 4 5 ...

561 Commits