tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-06 12:48:18 +00:00

Author	SHA1	Message	Date
Anthony MOI	337fe72b13	Python - Bindings for TemplateProcessing	2020-09-10 15:04:19 -04:00
Nicolas Patry	7b2caca764	Adding a new pre_tokenizer: Digits. Easier to split on digits: Digits(individual_digits=False) -> 'Call 123 please' becomes 'Call ', '123', 'please' Digits(individual_digits=True) -> 'Call 123 please' becomes 'Call ', '1', '2', '3', 'please'	2020-09-03 21:03:45 +02:00
Anthony MOI	b8f1eb48cb	Python - Bump version for 0.9.0.dev1 release	2020-09-02 22:31:01 -04:00
Nicolas Patry	558e76f18e	Expose the trainer to Python bindings.	2020-09-02 13:38:14 -04:00
Nicolas Patry	c0798acacf	Address @n1t0 comments.	2020-09-02 16:32:50 +02:00
Nicolas Patry	95e126cd82	Missed *.pyi file.	2020-09-02 16:32:50 +02:00
Nicolas Patry	dd91739ba0	Now spm_parity_check succeeds because we have the correct pre_tokenizer.	2020-09-02 16:32:50 +02:00
Nicolas Patry	e974cfb1c9	Formatting after rebase.	2020-09-02 16:32:50 +02:00
Nicolas Patry	439305eea0	Failing test for compatibility for SentencePieceUnigramTokenizer. - We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A). Could be linked to float precision and hard or impossible to fix (should not hinder model performance) - We are now fusing_unk by default as it's the case with spm_train - We are still failing on at least space deduplication. Probably should be handlded by a pre-tokenizer.	2020-09-02 16:32:50 +02:00
Nicolas Patry	76b86f6901	Removing forgotten places.	2020-08-31 14:05:39 -04:00
Nicolas Patry	857948e5b8	Addressing comments: - Remote Deduplication in favor of WhitespaceSplit. - Updated comments	2020-08-31 14:05:39 -04:00
Nicolas Patry	6887c0f04d	Black pass.	2020-08-31 14:05:39 -04:00
Nicolas Patry	7ed7f0f26a	Adding a 3 new PreTokenizers: - Deduplication : Removes duplicate spaces within strings - Punctuation: Splits punctuation characters as isolated tokens - Sequence: Applies a list of pretokenizers iteratively	2020-08-31 14:05:39 -04:00
Anthony MOI	c036cd4ced	Python - Bump version for 0.9.0.dev0 release	2020-08-21 18:52:29 -04:00
Anthony MOI	504d8c85d8	Remove Tokenizer::normalize This is actually a legacy function that doesn't really make sense now, and is getting really difficult to keep. So we remove it.	2020-08-19 12:42:12 -04:00
Anthony MOI	7833965dc4	Update Python bindings with new interface	2020-08-03 16:18:59 -04:00
Sebastian Pütz	0d7c232f95	Move Python source to subdirectory. This allows testing versions not built in-place. Otherwise importing (or testing) in the package root fails without develop builds. Replace maturin with setuptools_rust since maturin fails with proper project structure.	2020-07-25 23:40:47 +02:00

17 Commits