tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-05 20:28:22 +00:00

Author	SHA1	Message	Date
Anthony MOI	337fe72b13	Python - Bindings for TemplateProcessing	2020-09-10 15:04:19 -04:00
Nicolas Patry	df827d538f	Adding clippy as a linter within the Python binding. (#388 ) * Adding clippy as a linter within the Python binding. * Missing clippy (dropped commit ??)	2020-09-04 09:09:02 -04:00
Nicolas Patry	efa20202dc	Addressing @n1t0's comments.	2020-09-04 11:57:01 +02:00
Nicolas Patry	7b2caca764	Adding a new pre_tokenizer: Digits. Easier to split on digits: Digits(individual_digits=False) -> 'Call 123 please' becomes 'Call ', '123', 'please' Digits(individual_digits=True) -> 'Call 123 please' becomes 'Call ', '1', '2', '3', 'please'	2020-09-03 21:03:45 +02:00
Anthony MOI	b8f1eb48cb	Python - Bump version for 0.9.0.dev1 release	2020-09-02 22:31:01 -04:00
Nicolas Patry	816632c9fa	Removing `--release` compat test. - Leaving the one that checks that sampling follows the expected distribution. - Marking the python Unigram.train(..) test as slow - The python Unigram.train(..) test now uses `big.txt` file.	2020-09-02 13:38:14 -04:00
Nicolas Patry	d0366529b7	Use a smaller train file.	2020-09-02 13:38:14 -04:00
Nicolas Patry	7b5c2b92c6	Fixing test dependency.	2020-09-02 13:38:14 -04:00
Nicolas Patry	ee3860c029	Enabling training parity check for tokenizers.UnigramTrainer	2020-09-02 13:38:14 -04:00
Nicolas Patry	558e76f18e	Expose the trainer to Python bindings.	2020-09-02 13:38:14 -04:00
Nicolas Patry	52082b5476	New clippy comments?	2020-09-02 16:32:50 +02:00
Nicolas Patry	c0798acacf	Address @n1t0 comments.	2020-09-02 16:32:50 +02:00
Nicolas Patry	d624645cf3	Attempting to add UnigramTrainer to python bindings.	2020-09-02 16:32:50 +02:00
Nicolas Patry	95e126cd82	Missed *.pyi file.	2020-09-02 16:32:50 +02:00
Nicolas Patry	dd91739ba0	Now spm_parity_check succeeds because we have the correct pre_tokenizer.	2020-09-02 16:32:50 +02:00
Nicolas Patry	e974cfb1c9	Formatting after rebase.	2020-09-02 16:32:50 +02:00
Nicolas Patry	439305eea0	Failing test for compatibility for SentencePieceUnigramTokenizer. - We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A). Could be linked to float precision and hard or impossible to fix (should not hinder model performance) - We are now fusing_unk by default as it's the case with spm_train - We are still failing on at least space deduplication. Probably should be handlded by a pre-tokenizer.	2020-09-02 16:32:50 +02:00
Anthony MOI	bd8dac202c	Add failing test for from_file	2020-09-01 09:53:50 -04:00
Nicolas Patry	76b86f6901	Removing forgotten places.	2020-08-31 14:05:39 -04:00
Nicolas Patry	857948e5b8	Addressing comments: - Remote Deduplication in favor of WhitespaceSplit. - Updated comments	2020-08-31 14:05:39 -04:00
Nicolas Patry	1994dcad6e	Re-enabling Custom Serialize	2020-08-31 14:05:39 -04:00
Nicolas Patry	6887c0f04d	Black pass.	2020-08-31 14:05:39 -04:00
Nicolas Patry	7ed7f0f26a	Adding a 3 new PreTokenizers: - Deduplication : Removes duplicate spaces within strings - Punctuation: Splits punctuation characters as isolated tokens - Sequence: Applies a list of pretokenizers iteratively	2020-08-31 14:05:39 -04:00
Anthony MOI	c036cd4ced	Python - Bump version for 0.9.0.dev0 release	2020-08-21 18:52:29 -04:00
Anthony MOI	32a76b0331	Update CHANGELOGs	2020-08-21 18:52:15 -04:00
Anthony MOI	3d1322f108	Python - Improve and Test EncodeInput extraction	2020-08-21 18:39:49 -04:00
Anthony MOI	14adf18e5b	Python - Extract single pre-tokenized inputs from np.array	2020-08-21 18:39:49 -04:00
Anthony MOI	d919d68889	Python - InputSequence with references when possible	2020-08-21 18:39:49 -04:00
Anthony MOI	504d8c85d8	Remove Tokenizer::normalize This is actually a legacy function that doesn't really make sense now, and is getting really difficult to keep. So we remove it.	2020-08-19 12:42:12 -04:00
Anthony MOI	f92c9955e7	Python - Update bindings	2020-08-19 12:42:12 -04:00
Sebastian Pütz	10a39ba6b4	Add in-place train.	2020-08-04 15:59:33 -04:00
Sebastian Pütz	ac8af63f70	Trainers don't need Arc.	2020-08-04 15:59:33 -04:00
Anthony MOI	363adedb4c	Fixes and cleanup, suggestions by @n1t0.	2020-08-04 15:59:33 -04:00
Sebastian Pütz	f6adcf0e7c	Remove typetag, bump deps.	2020-08-04 15:59:33 -04:00
Sebastian Puetz	16f75d9efc	Ensure serialization works in all expected ways.	2020-08-04 15:59:33 -04:00
Sebastian Puetz	aaf8e932b1	Remove Send + Sync requirements from Model.	2020-08-04 15:59:33 -04:00
Sebastian Puetz	42b810488f	Hide generics	2020-08-04 15:59:33 -04:00
Sebastian Pütz	d62adf7195	Remove Container, changes to PyDecoder, cloneable Tokenizer. * derive Clone on Tokenizer and AddedVocabulary. * Replace Container with Arc wrapper for Decoders. * Prefix Rust Decoder types with Py. * Rename PyDecoder to CustomDecoder. * Change panic in serializing custom decoder to exception. * Re-enable training with cloneable Tokenizer. * Remove unsound Container, use Arc wrappers instead.	2020-08-04 15:59:33 -04:00
Sebastian Pütz	11e86a16c5	Remove Container from PostProcessors, replace with Arc. * prefix the Python types in Rust with Py. * remove unsound Container wrappers, replace with Arc.	2020-08-04 15:59:33 -04:00
Sebastian Pütz	b411443128	Remove Container from PreTokenizers, replace with Arc. * prefix the Python types in Rust with Py, rename PyPretokenizer to CustomPretokenizer * remove unsound Container wrappers, replace with Arc * change panic on trying to (de-)serialize custom pretokenizer to exception	2020-08-04 15:59:33 -04:00
Sebastian Pütz	08b8c48127	Remove Container from Normalizers, replace with Arc. * prefix the Python types in Rust with Py * remove unsound Container wrappers, replace with Arc	2020-08-04 15:59:33 -04:00
Sebastian Pütz	83a52c8080	Replace Model and Trainer Containers. * Implement changes necessary from generic Model in Tokenizer. * Temporarily disable training in Python since Clone can't be derived for Model until all components have been replaced. * Prefix Python types in Rust with Py.	2020-08-04 15:59:33 -04:00
Anthony MOI	dad70e8e85	Implement suggestions by @sebpuetz Co-authored-by: Sebastian Pütz <sebastian.puetz@uni-tuebingen.de>	2020-08-03 16:18:59 -04:00
Anthony MOI	7833965dc4	Update Python bindings with new interface	2020-08-03 16:18:59 -04:00
Anthony MOI	904ff24382	New API for PreTokenizer and Model + refactor Tokenizer - WIP	2020-08-03 16:18:59 -04:00
Sebastian Pütz	27e326ab2b	Fix deadlocks with custom python components.	2020-08-03 16:17:17 -04:00
Sebastian Pütz	0d7c232f95	Move Python source to subdirectory. This allows testing versions not built in-place. Otherwise importing (or testing) in the package root fails without develop builds. Replace maturin with setuptools_rust since maturin fails with proper project structure.	2020-07-25 23:40:47 +02:00
Anthony MOI	c901f86d52	Python - Bump version for 0.8.1	2020-07-20 16:33:48 -04:00
Anthony MOI	157feed9a5	Python - Bump version for 0.8.1.rc2	2020-07-17 13:12:23 -04:00
Setu Shah	1f2cc6ee73	Include license in PyPI package	2020-07-16 14:20:32 -04:00

1 2 3 4 5 ...

436 Commits