tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-07 13:18:31 +00:00

Author	SHA1	Message	Date
Bjarte Johansen	f32e0c09fc	Implement __new__ for PostProcessors Allows PostProcessors to be instansiated through python class constructor.	2020-02-10 10:43:53 +01:00
Bjarte Johansen	03508826cb	Implement __new__ on Decoders Allow decoders to be initialized from python using the class constructor.	2020-02-10 10:43:53 +01:00
Bjarte Johansen	4971e9608d	Implement __new__ on Trainers __new__ allows Trainers to be initialized in the normal python fashion.	2020-02-10 10:43:29 +01:00
Bjarte Johansen	0e5d81b400	Implement __new__ on Normalizers __new__ allows Normalizers to be initialized as normal python objects. This also means that Normalizers are given the correct class name.	2020-02-10 10:43:19 +01:00
Pierric Cistac	3adf199a0c	fix `pad` calls	2020-02-05 14:49:47 -05:00
Anthony MOI	9745786b89	Bump versions for release	2020-02-05 13:55:51 -05:00
Anthony MOI	89f6db28f0	update cargo.lock for indicatif	2020-02-05 13:38:12 -05:00
Anthony MOI	8decd020cb	Python - Provide mapping to original offsets As requested on #81	2020-02-05 13:33:19 -05:00
Anthony MOI	42c4691e4d	Python - Update Bert default special tokens Closes #106	2020-02-05 12:55:01 -05:00
MOI Anthony	a1284f6220	Merge pull request #128 from huitseeker/warts Maintenance : simplifications & update	2020-02-05 12:28:22 -05:00
Funtowicz Morgan	8200112e9b	Introduce WordLevel model for TransformerXL (#125 ) * Added lookup table model mapping string to id present in a vocab map. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * RustFmt Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Formatting. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix invalid void return on Rust side. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Python binding for LookupTable model Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Enable loading from Python's side. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Renamed LookupTable to WordLevel Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * RustFmt happy now. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * clippy happy now. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing mismatching names. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing mismatching names (one missing). Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-05 16:51:35 +00:00
François Garillot	42bc3cb21f	Simplify a few Option / Result pattern-matches	2020-02-05 07:11:47 -08:00
Funtowicz Morgan	6165910ca6	Char based delimiter splitting - TransfoXL (#114 ) * WIP delimiter splitter Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bind on Python side. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Add missing delimiter parameter in CharDelimiterSplit constructor. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Attempt to provide CharDelimiterSplit for node. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Apply Rust formatting. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * fix bindings node Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>	2020-02-04 16:23:00 +00:00
Anthony MOI	53637d4d88	Python - Also add missing special tokens for SentencePiece	2020-02-03 12:52:39 -05:00
Anthony MOI	9e0b971f20	Python - Add missing special tokens in implementations classes	2020-02-03 12:49:40 -05:00
MOI Anthony	a48b337d7b	Merge pull request #99 from kdexd/get-vocab-size Expose get_vocab_size in tokenizer python API.	2020-02-03 11:52:29 -05:00
Anthony MOI	b90104e705	Update Python bindings	2020-02-03 11:38:52 -05:00
Funtowicz Morgan	e365c1992b	Improve flexibility in some Python binding (#107 ) * Fix invalid method bindings on Python side. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Introduce factory function to create normalizer instance from the name of an unicode normalizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Rename BPETokenizer to CharBPETokenizer for clarity Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Give more flexibility in the way CharBPETokenizer handles normalizers creation. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Change .pyi file to reflection Normalizer hierarchy Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Make ByteLevelBPE as flexible for normalization than CharBPE. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-03 10:41:33 +00:00
Funtowicz Morgan	6524f09e99	Roberta PostProcessor (#111 ) * Added RobertaProcessor on Rust side. Required to match the double separator token in the middle of pairs. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix typo in RobertaProcessing method declaration Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Correctly include RobertProcessor in the Python binding Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Roberta doesnt use token_type_ids so let's set everything to 0 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Attempt to make it works on Node side too. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * fix js bindings / `npm run lint` * Make RustFmt happy. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>	2020-02-03 10:39:48 +00:00
Karan Desai	b027c63c37	Expose get_vocab_size in tokenizer python API.	2020-02-03 00:41:05 -05:00
Pierric Cistac	05275a9391	python: fix inverted normalized/original string range	2020-01-31 11:09:55 -05:00
Pierric Cistac	880cd7199b	python: align `Cargo.lock` package version	2020-01-28 16:44:48 -05:00
Anthony MOI	0105021280	Bump version for Python	2020-01-22 16:07:03 -05:00
MOI Anthony	327de00d71	Merge pull request #95 from huggingface/vocab-serialization save BPE vocab in order of ID	2020-01-22 15:49:48 -05:00
epwalsh	3a9badd2e0	save vocab in order of ID	2020-01-21 13:32:13 -08:00
Morgan Funtowicz	0b782e4507	Removed invalid class-level variable declaration. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-01-21 15:10:47 -05:00
Anthony MOI	da7e629e4a	Bump Python version for release	2020-01-20 09:14:46 -05:00
Anthony MOI	395f605fd2	Use WhitespaceSplit for BPETokenizer	2020-01-17 18:33:29 -05:00
Anthony MOI	9c408011ae	Python - Bindings for WhitespaceSplit	2020-01-17 18:15:14 -05:00
Ivan Echevarria	e82722a9c2	Fix typo in Python binding README Trailing paren causes an error	2020-01-16 17:10:48 -08:00
MOI Anthony	457e6c9932	Merge pull request #71 from huggingface/python_example_fix Use the same vocabs in python's example.py	2020-01-15 10:07:34 -05:00
Morgan Funtowicz	374f944e32	Use the same vocabs/merges for Python and Rust comparison. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-01-15 11:57:34 +01:00
Morgan Funtowicz	4839154145	Remove kwargs mapping on Tokenizer decode/decode_batch as their is only one possible arg. This is suggested by the current issue https://github.com/huggingface/tokenizers/issues/54#issuecomment-574104841. kwargs cannot be called as positional argument, they have to be named one, replacing kwargs with the actual skip_special_tokens allows both (named and positional) syntax. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-01-15 11:16:01 +01:00
Morgan Funtowicz	894f887444	Updated train_bert_wordpiece.py as well. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-01-14 13:32:02 +01:00
Morgan Funtowicz	7caf9fd823	Updated train_bytelevel_bpe.py to use the high level Python API. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-01-14 12:00:50 +01:00
Anthony MOI	fc9e81d4ab	Fix split on special tokens & bump version	2020-01-12 02:35:45 -05:00
Anthony MOI	dd569020c1	Bump python version for release	2020-01-10 13:49:26 -05:00
Anthony MOI	89e0d90c8a	Python - Final fix of the typings	2020-01-10 13:30:35 -05:00
Pierric Cistac	56878a8e43	fix :	2020-01-10 13:30:35 -05:00
Pierric Cistac	958883af74	fix imports in root __init__.pyi	2020-01-10 13:30:35 -05:00
MOI Anthony	b491c0b8c4	Update Python Readme	2020-01-10 12:18:16 -05:00
Anthony MOI	b27737d97c	Python - Typings update	2020-01-10 10:06:24 -05:00
thomwolf	d8f3fba245	fix training and wordpiece	2020-01-10 10:47:50 +01:00
thomwolf	1a802cb484	fix typos	2020-01-10 10:47:36 +01:00
Anthony MOI	d46ea842c2	Python - IndexableString accepts tuples directly	2020-01-10 00:32:30 -05:00
Morgan Funtowicz	be10f542ce	Added SentencePiece and YouTokenToMe model extractors. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-01-08 22:55:00 +01:00
Anthony MOI	3af2a43cae	Hotfix Python bindings	2020-01-08 16:20:05 -05:00
Anthony MOI	ef21c9a7b0	Hotfix for new Builder cc @epwalsh	2020-01-08 16:19:51 -05:00
Anthony MOI	c7d2800131	Python - Add model saving to base tokenizer	2020-01-08 14:44:17 -05:00
Anthony MOI	bbe31f9237	Quick README update	2020-01-08 14:07:48 -05:00

1 2 3 4

194 Commits