tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-07 13:18:31 +00:00

Author	SHA1	Message	Date
Nicolas Patry	152880ab3e	Adding truncation_side within `TruncationParams`. (#860 ) * Add truncation to enable_truncation * Fix typo * Adding truncation_side within `TruncationParams`. * Node serialization of this direction param. * Update the test. * Fixing warnings/lint. * Adding stuff (can't local debug :( ) * Slow loop... ;( * Stub.py. Co-authored-by: Niels Rogge <niels.rogge1@gmail.com>	2021-12-28 12:37:06 +01:00
Luc Georges	04368b1998	Truncate Right (#841 ) * feat(tokenizers): add truncate test case * !feat(tokenizer): truncate right * refacto(tokenizers): clippy * feat(bindings): update bindings for truncate() * fix(tokenizers): remove unsafe code * refacto(tokenizers): truncate direction * truncate direction enum * compute parts ranges beforehand * 2n space because encoding is dropped at the end of procedure * update bindings * add pip install in python bindings' make test * fix(node): clippy asks to use unwrap_or_else * fix(node): lint * refacto(tokenizers): replace Vec<Range<usize>> by Vec<(usize, usize)> * refacto(bindings): add match syntax * refacto(tokenizers): use mem::replace instead of mem::swap * refacto(tokenizers): assign value the normal way	2021-12-23 13:34:21 +01:00
Anthony MOI	b0ee27847f	Python - Prepare for release 0.11.0 (#799 )	2021-09-08 03:15:47 -04:00
Anthony MOI	b8b584d4e5	Python - Pretty json saving defaults to true (#793 ) * Python - Pretty json saving defaults to true * Update changelog	2021-09-02 08:43:54 -04:00
Anthony Moi	e44fdee4a1	Python - Add bindings to Tokenizer.from_pretrained	2021-08-31 09:00:05 -04:00
Vlad Artamonov	e2bf8daa3a	Add SplitDelimiterBehavior to Punctuation constructor (#657 ) Resolves: #642	2021-08-13 09:19:23 -04:00
kingyiusuen	c1100dcbe3	Fix typo in documentation (#743 ) * Doc - Fix typo (And instance of -> An instance of) * Add missing text_signature for WordLevel.from_file Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>	2021-08-13 08:08:23 -04:00
Sylvain Gugger	6616e699f7	Expand documentation of UnigramTrainer (#770 ) * Expand documentation of UnigramTrainer * Put doc at the source * Add signature * make style Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>	2021-08-12 10:12:26 -04:00
SaulLu	da4c7b10e4	Add a way to specify the unknown token in `SentencePieceUnigramTokenizer` python implem (#762 ) * add a way to specify the unknown token in `SentencePieceUnigramTokenizer` * add test that verify that an exception is raised for the missing unknown token * style * add test tokens	2021-08-12 09:42:44 -04:00
Anthony MOI	3a002c1aa8	Python - prepare for release 0.10.3	2021-05-24 16:59:10 -04:00
Nicolas Patry	2e2e7558f7	Add CTC Decoder for Wave2Vec models (#693 ) * Rust - add a CTCDecoder as a seperate mod * Adding bindings to Node + Python. * Clippy update. * Stub. * Fixing roberta.json URLs. * Moving test files to hf.co. * Update cargo check and clippy to 1.52. * Inner ':' actually is used for domains in sphinx. Making `domain` work correctly was just too much work so I went the easy way and have global roles for the custom rust extension. * Update struct naming and docs * Update changelog Co-authored-by: Thomaub <github.thomaub@gmail.com> Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2021-05-20 09:30:09 -04:00
Anthony MOI	32b3b7a0f2	Python - Prepare for release 0.10.2	2021-04-05 16:47:55 -04:00
Anthony MOI	bc8bbf637a	Prepare for python v0.10.1 (#625 )	2021-02-08 11:45:56 -05:00
Anthony MOI	d96442cbe8	Python - Prepare for release 0.10.1rc1 (#622 )	2021-02-04 10:37:00 -05:00
Anthony MOI	96b9972842	Fix SentencePiece tokenizers conversion	2021-02-03 12:44:46 -05:00
Anthony MOI	719bea76b9	Python - Prepare for release 0.10.0	2021-01-12 16:34:04 -05:00
devfon	b9c6bea75e	Add `fuse_unk` option to SentencePieceBPETokenizer (#574 ) * Add fuse_unk option to SentencePieceBPETokenizer * Fix style Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2021-01-12 16:07:59 -05:00
Anthony MOI	cca5d43038	Python - Fix breaking change in Model.save	2021-01-11 16:09:19 -05:00
Anthony MOI	49d11b1f69	Python - Add components getter/setters to BaseTokenizer	2021-01-11 16:08:38 -05:00
Anthony MOI	d94fa220b6	Python - Add train_from_iterator to implementations	2021-01-07 09:02:20 -05:00
Anthony MOI	0c6cc39eee	Python - Update CHANGELOG and bump for release	2020-12-08 13:29:35 -05:00
Tal Perry	8916b6bb27	Add a visualization utility to render tokens and annotations in a notebook (#508 ) * Draft functionality of visualization * Added comments to make code more intelligble * polish the styles * Ensure colors are stable and comment the css * Code clean up * Made visualizer importable and added some docs * Fix styling * implement comments from PR * Fixed the regex for UNK tokens and examples in notebook * Converted docs to google format * Added a notebook showing multiple languages and tokenizers * Added visual indication of chars that are tokenized with >1 token * Reorganize things a bit and fix import * Update docs Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-12-04 10:25:56 -05:00
Anthony MOI	3a8627ce4d	Improve docs and fix tests around training	2020-11-28 12:29:35 -05:00
Anthony MOI	6e364cb685	Python - Update CHANGELOG and stub files	2020-11-27 17:35:34 -05:00
Patrick von Platen	dd399d2ad0	Split Pre-Tokenizer (#542 ) * start playing around * make a first version * refactor * apply make format * add python bindings * add some python binding tests * correct pre-tokenizers * update auto-generated bindings * lint python bindings * add code node * add split to docs * refactor python binding a bit * cargo fmt * clippy and fmt in node * quick updates and fixes * Oops * Update node typings * Update changelog Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-27 17:07:03 -05:00
Anthony MOI	58e1d8de67	Python - Improve documentation for trainers	2020-11-23 11:52:51 -05:00
Anthony MOI	64441b54b1	Python - Improve documentation for post-processors	2020-11-23 11:52:51 -05:00
Anthony MOI	933a2a9c99	Python - Improve pre-tokenizers docs	2020-11-23 11:52:51 -05:00
Anthony MOI	5842b3db73	Python - Improve normalizers docs	2020-11-23 11:52:51 -05:00
Anthony MOI	c01c301743	Python - Improve documentation for decoders and remove useless kwargs	2020-11-23 11:52:51 -05:00
Anthony MOI	a50d4b7d25	Python - Improve documentation for models	2020-11-23 11:52:51 -05:00
Nick	dc60d4fc0c	Fix BaseTokenizer enable_truncation docstring	2020-11-23 11:28:26 -05:00
Anthony MOI	387b8a1033	Generate pyi, fix tests and clippy warnings	2020-11-20 13:30:44 -05:00
Anthony MOI	224862fe0c	Python - Make the trainer optional on Tokenizer.train	2020-11-20 13:30:44 -05:00
Anthony MOI	059d43b265	Add WordLevel trainer	2020-11-20 13:30:44 -05:00
Anthony MOI	58b618f98e	Python - Update __init__.pyi	2020-11-17 15:28:41 -05:00
Nicolas Patry	352c92ad33	Automatically stubbing the `pyi` files while keeping inspecting ability (#509 ) * First pass on automatic stubbing our python files. * And now modifying all rust docs to be visible in Pyi files. * Better assert fail message. * Fixing github workflow. * Removing types not exported anymore. * Fixing `Tokenizer` signature. * Disabling auto __init__.py. * Re-enabling some types. * Don't overwrite non automated __init__.py * Automated most __init__.py * Restubbing after rebase. * Fixing env for tests. * Install blakc in the env. * Use PY35 target in stub.py Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2020-11-17 15:13:00 -05:00
Anthony MOI	75b41dab0f	Python - Update CHANGELOG and bump version for 0.9.4	2020-11-09 16:36:04 -05:00
Anthony MOI	57d162b269	Add an Encoding.sequences to allow masking	2020-11-06 10:41:56 -05:00
Anthony MOI	385d25720a	Simplify the API for Encoding.token_to_XXX	2020-11-06 10:41:56 -05:00
Anthony MOI	dce218ca28	Python - Encoding mappings handle sequence_id	2020-11-06 10:41:56 -05:00
Mohamed Al Salti	20c7045ba1	Update sentencepiece_unigram.py Update the URL to `sentencepiece_model_pb2.py` in the error message.	2020-11-04 08:08:11 -05:00
Anthony MOI	d2fc0e4836	Doc - Update API Reference for Encoding	2020-11-02 17:07:27 -05:00
Anthony MOI	a86d49634c	Doc - API Reference for most Tokenizer methods/attributes	2020-11-02 17:07:27 -05:00
Anthony MOI	79f02bb7f0	Doc - Updated API Reference for encode/encode_batch	2020-11-02 17:07:27 -05:00
taufique74	4929809af0	makes from_file() method static	2020-11-01 13:15:15 -05:00
Anthony MOI	2364d376f7	Python - Update CHANGELOG and bump to 0.9.3 for release	2020-10-26 16:40:24 -04:00
Anthony MOI	1a6f4b5204	Allow initial_alphabet on UnigramTrainer	2020-10-26 10:57:29 -04:00
Timur Ganiev	f7c61c267a	Fixed `BPE.read_files` -> `BPE.read_file` in SentencePieceBPETokenizer	2020-10-26 10:57:14 -04:00
Anthony MOI	a2289d49b4	Finish exposing the UnicodeScripts PreTokenizer	2020-10-21 11:01:54 -04:00

1 2

95 Commits