tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-07 21:28:19 +00:00

Author	SHA1	Message	Date
Nicolas Patry	95e126cd82	Missed *.pyi file.	2020-09-02 16:32:50 +02:00
Nicolas Patry	dd91739ba0	Now spm_parity_check succeeds because we have the correct pre_tokenizer.	2020-09-02 16:32:50 +02:00
Nicolas Patry	e974cfb1c9	Formatting after rebase.	2020-09-02 16:32:50 +02:00
Nicolas Patry	439305eea0	Failing test for compatibility for SentencePieceUnigramTokenizer. - We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A). Could be linked to float precision and hard or impossible to fix (should not hinder model performance) - We are now fusing_unk by default as it's the case with spm_train - We are still failing on at least space deduplication. Probably should be handlded by a pre-tokenizer.	2020-09-02 16:32:50 +02:00
Anthony MOI	bd8dac202c	Add failing test for from_file	2020-09-01 09:53:50 -04:00
Nicolas Patry	76b86f6901	Removing forgotten places.	2020-08-31 14:05:39 -04:00
Nicolas Patry	857948e5b8	Addressing comments: - Remote Deduplication in favor of WhitespaceSplit. - Updated comments	2020-08-31 14:05:39 -04:00
Nicolas Patry	1994dcad6e	Re-enabling Custom Serialize	2020-08-31 14:05:39 -04:00
Nicolas Patry	6887c0f04d	Black pass.	2020-08-31 14:05:39 -04:00
Nicolas Patry	7ed7f0f26a	Adding a 3 new PreTokenizers: - Deduplication : Removes duplicate spaces within strings - Punctuation: Splits punctuation characters as isolated tokens - Sequence: Applies a list of pretokenizers iteratively	2020-08-31 14:05:39 -04:00
Anthony MOI	c036cd4ced	Python - Bump version for 0.9.0.dev0 release	2020-08-21 18:52:29 -04:00
Anthony MOI	32a76b0331	Update CHANGELOGs	2020-08-21 18:52:15 -04:00
Anthony MOI	3d1322f108	Python - Improve and Test EncodeInput extraction	2020-08-21 18:39:49 -04:00
Anthony MOI	14adf18e5b	Python - Extract single pre-tokenized inputs from np.array	2020-08-21 18:39:49 -04:00
Anthony MOI	d919d68889	Python - InputSequence with references when possible	2020-08-21 18:39:49 -04:00
Anthony MOI	504d8c85d8	Remove Tokenizer::normalize This is actually a legacy function that doesn't really make sense now, and is getting really difficult to keep. So we remove it.	2020-08-19 12:42:12 -04:00
Anthony MOI	f92c9955e7	Python - Update bindings	2020-08-19 12:42:12 -04:00
Sebastian Pütz	10a39ba6b4	Add in-place train.	2020-08-04 15:59:33 -04:00
Sebastian Pütz	ac8af63f70	Trainers don't need Arc.	2020-08-04 15:59:33 -04:00
Anthony MOI	363adedb4c	Fixes and cleanup, suggestions by @n1t0.	2020-08-04 15:59:33 -04:00
Sebastian Pütz	f6adcf0e7c	Remove typetag, bump deps.	2020-08-04 15:59:33 -04:00
Sebastian Puetz	16f75d9efc	Ensure serialization works in all expected ways.	2020-08-04 15:59:33 -04:00
Sebastian Puetz	aaf8e932b1	Remove Send + Sync requirements from Model.	2020-08-04 15:59:33 -04:00
Sebastian Puetz	42b810488f	Hide generics	2020-08-04 15:59:33 -04:00
Sebastian Pütz	d62adf7195	Remove Container, changes to PyDecoder, cloneable Tokenizer. * derive Clone on Tokenizer and AddedVocabulary. * Replace Container with Arc wrapper for Decoders. * Prefix Rust Decoder types with Py. * Rename PyDecoder to CustomDecoder. * Change panic in serializing custom decoder to exception. * Re-enable training with cloneable Tokenizer. * Remove unsound Container, use Arc wrappers instead.	2020-08-04 15:59:33 -04:00
Sebastian Pütz	11e86a16c5	Remove Container from PostProcessors, replace with Arc. * prefix the Python types in Rust with Py. * remove unsound Container wrappers, replace with Arc.	2020-08-04 15:59:33 -04:00
Sebastian Pütz	b411443128	Remove Container from PreTokenizers, replace with Arc. * prefix the Python types in Rust with Py, rename PyPretokenizer to CustomPretokenizer * remove unsound Container wrappers, replace with Arc * change panic on trying to (de-)serialize custom pretokenizer to exception	2020-08-04 15:59:33 -04:00
Sebastian Pütz	08b8c48127	Remove Container from Normalizers, replace with Arc. * prefix the Python types in Rust with Py * remove unsound Container wrappers, replace with Arc	2020-08-04 15:59:33 -04:00
Sebastian Pütz	83a52c8080	Replace Model and Trainer Containers. * Implement changes necessary from generic Model in Tokenizer. * Temporarily disable training in Python since Clone can't be derived for Model until all components have been replaced. * Prefix Python types in Rust with Py.	2020-08-04 15:59:33 -04:00
Anthony MOI	dad70e8e85	Implement suggestions by @sebpuetz Co-authored-by: Sebastian Pütz <sebastian.puetz@uni-tuebingen.de>	2020-08-03 16:18:59 -04:00
Anthony MOI	7833965dc4	Update Python bindings with new interface	2020-08-03 16:18:59 -04:00
Anthony MOI	904ff24382	New API for PreTokenizer and Model + refactor Tokenizer - WIP	2020-08-03 16:18:59 -04:00
Sebastian Pütz	27e326ab2b	Fix deadlocks with custom python components.	2020-08-03 16:17:17 -04:00
Sebastian Pütz	0d7c232f95	Move Python source to subdirectory. This allows testing versions not built in-place. Otherwise importing (or testing) in the package root fails without develop builds. Replace maturin with setuptools_rust since maturin fails with proper project structure.	2020-07-25 23:40:47 +02:00
Anthony MOI	c901f86d52	Python - Bump version for 0.8.1	2020-07-20 16:33:48 -04:00
Anthony MOI	157feed9a5	Python - Bump version for 0.8.1.rc2	2020-07-17 13:12:23 -04:00
Setu Shah	1f2cc6ee73	Include license in PyPI package	2020-07-16 14:20:32 -04:00
Anthony MOI	5be375eaea	Update CHANGELOGs and bump version for python release	2020-07-06 15:21:47 -04:00
Anthony MOI	e874641cf9	Merge pull request #333 from huggingface/fix-added-tokens Python - Fix Added token deserialization	2020-07-06 14:52:37 -04:00
Anthony MOI	2194970679	Merge pull request #330 from huggingface/bert-normalization Improve BertNormalizer behavior	2020-07-06 14:52:23 -04:00
Anthony MOI	d33af1a3be	Python - Fix Added token deserialization	2020-07-06 14:46:12 -04:00
Anthony MOI	7a95ffc4fa	BertNormalizer has same behavior than original implem	2020-07-06 13:55:18 -04:00
Anthony MOI	8bf482cecc	Improve parallelism tracking and warning	2020-07-06 13:05:14 -04:00
आलोक	6fe284dd8d	Use supplied UNK token even when vocab absent If a vocab file isn't provided the supplied unk token (different from [UNK]) gets ignored and later throws an error: Exception: WordPiece error: Missing [UNK] token from the vocabulary when trying to encode an input string with an unknown token.	2020-07-05 19:01:04 +05:30
Anthony MOI	6349ca51b3	Python - Bump version for 0.8.0 release	2020-06-26 16:12:26 -04:00
Anthony MOI	8ae1982149	Finally it will be rc4 for transformers	2020-06-26 15:36:08 -04:00
Anthony MOI	5a653869af	Try local version for transformers	2020-06-26 15:19:00 -04:00
Anthony MOI	1a08b21329	Python - Bump version for 0.8.0.transformers release	2020-06-26 14:37:22 -04:00
Anthony MOI	bb668bc439	Try with target_family = unix	2020-06-23 16:52:21 -04:00
Anthony MOI	f8b1630aa6	Update CHANGELOGs	2020-06-23 13:32:21 -04:00

1 2 3 4 5 ...

423 Commits