tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-07 21:28:19 +00:00

Author	SHA1	Message	Date
Anthony MOI	f263d7651f	Python - RustFmt	2020-02-18 15:07:34 -05:00
Anthony MOI	8e9fae6be4	Python - Add `check-style` to Makefile	2020-02-18 11:11:07 -05:00
Anthony MOI	81be207819	Python - Black auto formatting	2020-02-18 10:45:36 -05:00
Anthony MOI	4706151c32	Python - Add Makefile with Black formatting	2020-02-18 10:45:10 -05:00
Anthony MOI	1509f747af	Python - Uniformize implementations parameters	2020-02-18 10:27:10 -05:00
MOI Anthony	3512bd3400	Merge pull request #149 from colinclement/master Allow dropout option in ByteLevelBPETokenizer	2020-02-18 09:59:40 -05:00
Morgan Funtowicz	891dd4adb8	Fix invalid num_added_tokens method call in BaseTokenizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-17 15:32:34 +01:00
Funtowicz Morgan	bb8321ac0d	Add Strip normalizer (#140 ) * WIP strip. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Rust StripNormalizer Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Allow to specify strip direction Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Renamed StripNormalizer to Strip Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added Python binding. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Makes Strip python compatible with pythonic constructor. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Run RustFmt Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Clippy next ofc. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Move lstrip and rstrip on NormalizedString Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * implment strip() for normalizer + unittests. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Add some more unittests on edge cases. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * clippy and fmt. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Simplify strip and fix offsets * Python - Update strip bindings with default values Co-authored-by: MOI Anthony <xn1t0x@gmail.com>	2020-02-17 11:26:40 +01:00
Colin Clement	e591cfce7b	pass through dropout option in ByteLevelBPETokenizer	2020-02-15 01:58:55 +00:00
MOI Anthony	3cac26cdb2	Merge pull request #147 from huggingface/wordpiece-cleanup Wordpiece Decoder cleanup	2020-02-14 13:12:15 -05:00
Funtowicz Morgan	c4bac6aeeb	Expose num_added_tokens on Python side (#146 ) * Expose num_added_tokens on Python side without the need to pass an Encoding to added_tokens. This allows to compute the max sentence length for single/pair inputs without actually the need to have an Encoding structure. As the number of added tokens is fixed and static during compilation it allows more flexible usage of the method. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Renamed num_added_tokens to num_special_tokens_to_add. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-14 10:55:20 +00:00
Anthony MOI	1907b74d1c	Python - Bindings for Wordpiece decoder's cleanup	2020-02-13 17:50:37 -05:00
Anthony MOI	5bd93ee822	Python - hotfix BertWordPieceTokenizer decoder	2020-02-13 16:31:00 -05:00
Anthony MOI	bbbd97c7e1	Python - Bump version for release	2020-02-11 08:15:11 -05:00
Anthony MOI	08ce105195	Python - Hotfix WordPieceTrainer constructor	2020-02-11 08:13:57 -05:00
Anthony MOI	c1ddfdac8c	Python - bump version for release	2020-02-10 23:23:27 -05:00
Anthony MOI	3c0164ef75	Python - Bump version for release	2020-02-10 16:07:32 -05:00
Anthony MOI	43a989775e	Python - Improve typings	2020-02-10 13:53:07 -05:00
Anthony MOI	dd9270a406	Python - Fix example.py for GPT-2 cc @mfuntowicz `from_pretrained` takes only on argument. Do you know if we can make this compatible otherwise?	2020-02-10 13:51:03 -05:00
Anthony MOI	8585b761d1	Python - More updates to the new API	2020-02-10 11:57:30 -05:00
Anthony MOI	505c428f72	Python - Update example.py with new API	2020-02-10 11:55:14 -05:00
Bjarte Johansen	6a4976ddd6	Implement __new__ for PreTokenizers __new__ allows PreTokenizers to be instansiated through the python constructor.	2020-02-10 10:43:53 +01:00
Bjarte Johansen	f32e0c09fc	Implement __new__ for PostProcessors Allows PostProcessors to be instansiated through python class constructor.	2020-02-10 10:43:53 +01:00
Bjarte Johansen	03508826cb	Implement __new__ on Decoders Allow decoders to be initialized from python using the class constructor.	2020-02-10 10:43:53 +01:00
Bjarte Johansen	4971e9608d	Implement __new__ on Trainers __new__ allows Trainers to be initialized in the normal python fashion.	2020-02-10 10:43:29 +01:00
Bjarte Johansen	0e5d81b400	Implement __new__ on Normalizers __new__ allows Normalizers to be initialized as normal python objects. This also means that Normalizers are given the correct class name.	2020-02-10 10:43:19 +01:00
Pierric Cistac	3adf199a0c	fix `pad` calls	2020-02-05 14:49:47 -05:00
Anthony MOI	9745786b89	Bump versions for release	2020-02-05 13:55:51 -05:00
Anthony MOI	89f6db28f0	update cargo.lock for indicatif	2020-02-05 13:38:12 -05:00
Anthony MOI	8decd020cb	Python - Provide mapping to original offsets As requested on #81	2020-02-05 13:33:19 -05:00
Anthony MOI	42c4691e4d	Python - Update Bert default special tokens Closes #106	2020-02-05 12:55:01 -05:00
MOI Anthony	a1284f6220	Merge pull request #128 from huitseeker/warts Maintenance : simplifications & update	2020-02-05 12:28:22 -05:00
Funtowicz Morgan	8200112e9b	Introduce WordLevel model for TransformerXL (#125 ) * Added lookup table model mapping string to id present in a vocab map. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * RustFmt Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Formatting. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix invalid void return on Rust side. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Python binding for LookupTable model Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Enable loading from Python's side. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Renamed LookupTable to WordLevel Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * RustFmt happy now. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * clippy happy now. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing mismatching names. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing mismatching names (one missing). Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-05 16:51:35 +00:00
François Garillot	42bc3cb21f	Simplify a few Option / Result pattern-matches	2020-02-05 07:11:47 -08:00
Funtowicz Morgan	6165910ca6	Char based delimiter splitting - TransfoXL (#114 ) * WIP delimiter splitter Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bind on Python side. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Add missing delimiter parameter in CharDelimiterSplit constructor. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Attempt to provide CharDelimiterSplit for node. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Apply Rust formatting. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * fix bindings node Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>	2020-02-04 16:23:00 +00:00
Anthony MOI	53637d4d88	Python - Also add missing special tokens for SentencePiece	2020-02-03 12:52:39 -05:00
Anthony MOI	9e0b971f20	Python - Add missing special tokens in implementations classes	2020-02-03 12:49:40 -05:00
MOI Anthony	a48b337d7b	Merge pull request #99 from kdexd/get-vocab-size Expose get_vocab_size in tokenizer python API.	2020-02-03 11:52:29 -05:00
Anthony MOI	b90104e705	Update Python bindings	2020-02-03 11:38:52 -05:00
Funtowicz Morgan	e365c1992b	Improve flexibility in some Python binding (#107 ) * Fix invalid method bindings on Python side. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Introduce factory function to create normalizer instance from the name of an unicode normalizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Rename BPETokenizer to CharBPETokenizer for clarity Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Give more flexibility in the way CharBPETokenizer handles normalizers creation. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Change .pyi file to reflection Normalizer hierarchy Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Make ByteLevelBPE as flexible for normalization than CharBPE. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-03 10:41:33 +00:00
Funtowicz Morgan	6524f09e99	Roberta PostProcessor (#111 ) * Added RobertaProcessor on Rust side. Required to match the double separator token in the middle of pairs. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix typo in RobertaProcessing method declaration Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Correctly include RobertProcessor in the Python binding Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Roberta doesnt use token_type_ids so let's set everything to 0 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Attempt to make it works on Node side too. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * fix js bindings / `npm run lint` * Make RustFmt happy. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>	2020-02-03 10:39:48 +00:00
Karan Desai	b027c63c37	Expose get_vocab_size in tokenizer python API.	2020-02-03 00:41:05 -05:00
Pierric Cistac	05275a9391	python: fix inverted normalized/original string range	2020-01-31 11:09:55 -05:00
Pierric Cistac	880cd7199b	python: align `Cargo.lock` package version	2020-01-28 16:44:48 -05:00
Anthony MOI	0105021280	Bump version for Python	2020-01-22 16:07:03 -05:00
MOI Anthony	327de00d71	Merge pull request #95 from huggingface/vocab-serialization save BPE vocab in order of ID	2020-01-22 15:49:48 -05:00
epwalsh	3a9badd2e0	save vocab in order of ID	2020-01-21 13:32:13 -08:00
Morgan Funtowicz	0b782e4507	Removed invalid class-level variable declaration. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-01-21 15:10:47 -05:00
Anthony MOI	da7e629e4a	Bump Python version for release	2020-01-20 09:14:46 -05:00
Anthony MOI	395f605fd2	Use WhitespaceSplit for BPETokenizer	2020-01-17 18:33:29 -05:00

1 2 3 4 5

216 Commits