tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
Tal Perry	8916b6bb27	Add a visualization utility to render tokens and annotations in a notebook (#508 ) * Draft functionality of visualization * Added comments to make code more intelligble * polish the styles * Ensure colors are stable and comment the css * Code clean up * Made visualizer importable and added some docs * Fix styling * implement comments from PR * Fixed the regex for UNK tokens and examples in notebook * Converted docs to google format * Added a notebook showing multiple languages and tokenizers * Added visual indication of chars that are tokenized with >1 token * Reorganize things a bit and fix import * Update docs Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-12-04 10:25:56 -05:00
Anthony MOI	5549fc4837	Python - Update CHANGELOG	2020-11-28 12:42:37 -05:00
Anthony MOI	49bd055519	Node - Update bindings with train_from_files	2020-11-28 12:29:35 -05:00
Anthony MOI	3a8627ce4d	Improve docs and fix tests around training	2020-11-28 12:29:35 -05:00
Anthony MOI	06f6ba3fce	Use train_from_files in benchmarks	2020-11-28 12:29:35 -05:00
Anthony MOI	999067454d	Make sure we first try to extract a string	2020-11-28 12:29:35 -05:00
Anthony MOI	ed9baeabb7	Add example for training with datasets	2020-11-28 12:29:35 -05:00
Anthony MOI	c36ac0bfdf	Improve progress tracking while training	2020-11-28 12:29:35 -05:00
Anthony MOI	75deaecdd0	Also accept iterators of batches in train_from_iterator	2020-11-28 12:29:35 -05:00
Anthony MOI	e0a70f1fb2	Add ability to train from Iterator	2020-11-28 12:29:35 -05:00
Anthony MOI	6e364cb685	Python - Update CHANGELOG and stub files	2020-11-27 17:35:34 -05:00
Anthony MOI	a351d1c604	Python - Trainers can get/set their attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	3eb7ef6d0a	Python - PreTokenizers can get/set their attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	5c35fafc44	Python - Decoders can get/set their attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	091287dcf5	Python - Use macro for getter/setter in models	2020-11-27 17:35:34 -05:00
Anthony MOI	2feccdbbfa	Python - PyStrip can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	7512d5e4ce	Python - PyBertNormalizer can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	78beae8b7d	Python - PyWordLevel can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	760537aad3	Python - PyWordPiece can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	c22cfc31f9	Python - PyNormalizer & PyPreTokenizer use a RwLock	2020-11-27 17:35:34 -05:00
Anthony MOI	76d3b2128b	Python - PyBPE can get/set its attributes	2020-11-27 17:35:34 -05:00
Anthony MOI	7f3cfebf45	Python - PyModel uses a RwLock to allow modifications	2020-11-27 17:35:34 -05:00
Patrick von Platen	dd399d2ad0	Split Pre-Tokenizer (#542 ) * start playing around * make a first version * refactor * apply make format * add python bindings * add some python binding tests * correct pre-tokenizers * update auto-generated bindings * lint python bindings * add code node * add split to docs * refactor python binding a bit * cargo fmt * clippy and fmt in node * quick updates and fixes * Oops * Update node typings * Update changelog Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-27 17:07:03 -05:00
Anthony MOI	58e1d8de67	Python - Improve documentation for trainers	2020-11-23 11:52:51 -05:00
Anthony MOI	64441b54b1	Python - Improve documentation for post-processors	2020-11-23 11:52:51 -05:00
Anthony MOI	933a2a9c99	Python - Improve pre-tokenizers docs	2020-11-23 11:52:51 -05:00
Anthony MOI	5842b3db73	Python - Improve normalizers docs	2020-11-23 11:52:51 -05:00
Anthony MOI	c01c301743	Python - Improve documentation for decoders and remove useless kwargs	2020-11-23 11:52:51 -05:00
Anthony MOI	a50d4b7d25	Python - Improve documentation for models	2020-11-23 11:52:51 -05:00
Nick	dc60d4fc0c	Fix BaseTokenizer enable_truncation docstring	2020-11-23 11:28:26 -05:00
Anthony MOI	2fbd6779f6	Make sure TrainerWrapper can only train the right Model	2020-11-20 13:30:44 -05:00
Anthony MOI	13e07da2c8	Node - Add WordLevelTrainer	2020-11-20 13:30:44 -05:00
Anthony MOI	7fc37a03e8	Node - Trainers train the Model in-place	2020-11-20 13:30:44 -05:00
Anthony MOI	387b8a1033	Generate pyi, fix tests and clippy warnings	2020-11-20 13:30:44 -05:00
Anthony MOI	5059be1a8d	Test BPE keeping its options after training	2020-11-20 13:30:44 -05:00
Anthony MOI	284a1dbee7	PyModel uses a RwLock to allow modifications	2020-11-20 13:30:44 -05:00
Anthony MOI	54c7210b2f	Train Model in place This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.	2020-11-20 13:30:44 -05:00
Anthony MOI	224862fe0c	Python - Make the trainer optional on Tokenizer.train	2020-11-20 13:30:44 -05:00
Anthony MOI	c230183cf6	A Model can return its associated Trainer	2020-11-20 13:30:44 -05:00
Anthony MOI	059d43b265	Add WordLevel trainer	2020-11-20 13:30:44 -05:00
Anthony MOI	a745321aca	Rust - Trainer::process_tokens has a default impl	2020-11-20 13:30:44 -05:00
Anthony MOI	2a37ba3c25	Doc - Update deploy_doc to stop rebuilding existing version	2020-11-20 10:34:26 -05:00
Anthony MOI	cb471d1380	Restore latest stable in rust-toolchain	2020-11-19 16:17:34 -05:00
Anthony MOI	a1012536b6	Add a github workflow for conda release	2020-11-19 16:17:34 -05:00
LysandreJik	ef83ad3641	Conda meta & unix build script	2020-11-19 16:17:34 -05:00
Anthony MOI	58b618f98e	Python - Update __init__.pyi	2020-11-17 15:28:41 -05:00
Nicolas Patry	352c92ad33	Automatically stubbing the `pyi` files while keeping inspecting ability (#509 ) * First pass on automatic stubbing our python files. * And now modifying all rust docs to be visible in Pyi files. * Better assert fail message. * Fixing github workflow. * Removing types not exported anymore. * Fixing `Tokenizer` signature. * Disabling auto __init__.py. * Re-enabling some types. * Don't overwrite non automated __init__.py * Automated most __init__.py * Restubbing after rebase. * Fixing env for tests. * Install blakc in the env. * Use PY35 target in stub.py Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-17 15:13:00 -05:00
Nicolas Patry	fff856cff7	New PR to fix #270 (not #157 ). (#516 ) * New PR to fix #270 (not #157). Reduce drastically the number of required compilation flags. I think it's good enough for merge right now. We disable progress altogether when the `progressbar` flag is disabled which is perfectly fine compared to not being able to build. Future PR could include. - Better encapsulation of `progress` in training call sites (less direct calls to `indicatif` and common code for `setup_progress`, `finalize` and so on. - We can have a raw `print` Progress bar when compilation flag is disabled ? - Having better control of progressbars in bindings would require use to change a bunch of code around which might be overkill in the short term. Either we start by defining a trait for our ProgressBar, and the bindings can implement the traits with custom `tqdm` and `cli-progress` (It's not even 100% sure it's doable) - The easiest way would be to enable some sort of iterator in Rust so that calling of progressbars can happen in client code which would be the most lenient for all plateforms. The hard part is that leveraging parallelism in that setting would be hard probably. * Remove external visibility of progressbar. * Remove dead import.	2020-11-11 10:51:27 +01:00
Nicolas Patry	b122737ec6	Moving to manylinux2010 and remove nightly on Windows. (#455 ) * Moving to manylinux2010 and remove nightly on Windows. * Add build for manylinux2014 for powerpc and aarch64 + Python v3.9 * Also add support for IBM mainframe * try with env variables * Move extra builds to their own workflow Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-09 23:23:07 -05:00
Anthony MOI	b0d8108dcb	Doc - Update for 0.9.4	2020-11-09 16:36:04 -05:00

1 2 3 4 5 ...

1409 Commits