tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-05 04:08:22 +00:00

Author	SHA1	Message	Date
Anthony MOI	003d2ac6fb	Python - Update PyToken bindings	2020-09-23 15:50:01 -04:00
Anthony MOI	fce6998dcf	Python - Add bindings for NormalizedString	2020-09-23 15:50:01 -04:00
Anthony MOI	e4b10e0fd9	Python - Add RefMutGuard to safely share &mut	2020-09-23 15:50:01 -04:00
Anthony MOI	a42e13a644	Setup black format in pyproject.toml	2020-09-23 11:58:35 -04:00
Nicolas Patry	9d3a93db5b	Going back for `not` fuse_unk by default for BPE, but add a flag to enable it.	2020-09-22 16:27:09 -04:00
Anthony MOI	940f8bd8fa	Update PyO3 (#426 )	2020-09-22 12:00:20 -04:00
Nicolas Patry	c536b4992b	Move to dev3 build.	2020-09-22 08:21:38 +02:00
Nicolas Patry	07197e8e35	Move to spm_precompiled 0.1.2 for smaller binary string.	2020-09-22 08:21:38 +02:00
Nicolas Patry	033b98ce59	Updating convert scripts with Replace normalizer.	2020-09-22 08:21:38 +02:00
Nicolas Patry	c59b216baa	Fixing convert/check scripts.	2020-09-22 08:21:38 +02:00
Nicolas Patry	c0b9229833	Fixed vietnamese bug, now we have a thai bug.	2020-09-22 08:21:38 +02:00
Nicolas Patry	b16406c900	Moving StripAccents within normalizer for Albert +XLNet, but now crash in Precompiled. offsets are wrong ?	2020-09-22 08:21:38 +02:00
Nicolas Patry	275ee6d4c4	Making convert script machine agnostic.	2020-09-22 08:21:38 +02:00
Nicolas Patry	2fd1d9cf06	Adding a new convert script, that will convert all python Tokenizer code into a proper Rust Tokenizer format and check it on a file. - Also fuse_unks by default in `tokenizers`'s BPE.	2020-09-22 08:21:38 +02:00
Nicolas Patry	aea22a4004	Adding node bindings. - simplify normalizer. - simplify python bindings.	2020-09-18 12:24:39 +02:00
Nicolas Patry	792d618006	Adding a new "Replace" normalizer that takes a string and replaces it with another String (for now).	2020-09-18 12:24:39 +02:00
Nicolas Patry	75464734df	Adding a new normalizer that strips accents by removing combining (#416 ) * Adding a new normalizer that strips accents by removing combining characters in unicode strings. * Adding Node bindings + better normalizer impl. * Doc comment -> Regular comment.	2020-09-17 09:49:41 +02:00
Nicolas Patry	330876ae02	Improvements on spm parity: (#401 ) * Removing all pre_tokenizer logic from Unigram algorithm. * Improving a lot the parity check. - We can now detect a lot more errors - Special cases have been added temporarily. * Adding 2 new normalizers that mimick spm defaut's behavior. * Adding `encoding_optimized` version of the `encode` algorithm. - Removes Lattice allocation. - Changes trie `common_prefix_search` to return an iterator to avoid allocation of the full results. * Trie<char> -> Trie<u8> Another improvement on speed. * [WIP] Attempt to create a Precompiled Normalizer from SPM to be 100% compliant with arbitrary models. * Adding a new `Precompiled` Normalizer that is replacing `SpmNmtNfkc`. - It will be used for direct compatiblity with `Spm` and replace all their custom rules by using directly the normalizer spec embedded within spm files, removing all need for any rules for us. - We need `nom` dependency to parse the binary format of `spm`. - We need to add `sentencepiece_model_pb2.py` file to be able to read the proto file. - We reimplemented their `Darts::DoubleArray` compact trie format. * Fixing a bug with Precompiled normalizer. * Fixing some edge cases (now in tests) with this weird precompiled normalizer. It seems a very handy crafted trie does not prevent from shooting oneself in the foot. Sorry future reader. * Keep API stable for this PR (change of the API should come later #409). - Removed sentencepiece_model_pb2 from binding and add instructions to make `from_spm` work. * Adding model check in `from_spm`. * Adressing @n1t0's comments. * Adding a check to make sure alignments stay correct. Also added a bit more documentation on how Precompiled works. * Extracting `Precompiled` into it's own `spm_precompiled` crate. * Using ranges in `do_nmt`.	2020-09-15 22:21:02 +02:00
Nicolas Patry	62c3d40f11	Upgrading dependencies (esaxx-rs to build).	2020-09-14 13:33:15 +02:00
Anthony MOI	fee1d4e8a3	TemplateProcessing - Add @narsil suggestions Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2020-09-10 15:04:19 -04:00
Anthony MOI	b7df6539e6	TemplateProcessing: Update CHANGELOGs	2020-09-10 15:04:19 -04:00
Anthony MOI	337fe72b13	Python - Bindings for TemplateProcessing	2020-09-10 15:04:19 -04:00
Nicolas Patry	df827d538f	Adding clippy as a linter within the Python binding. (#388 ) * Adding clippy as a linter within the Python binding. * Missing clippy (dropped commit ??)	2020-09-04 09:09:02 -04:00
Nicolas Patry	efa20202dc	Addressing @n1t0's comments.	2020-09-04 11:57:01 +02:00
Nicolas Patry	7b2caca764	Adding a new pre_tokenizer: Digits. Easier to split on digits: Digits(individual_digits=False) -> 'Call 123 please' becomes 'Call ', '123', 'please' Digits(individual_digits=True) -> 'Call 123 please' becomes 'Call ', '1', '2', '3', 'please'	2020-09-03 21:03:45 +02:00
Anthony MOI	b8f1eb48cb	Python - Bump version for 0.9.0.dev1 release	2020-09-02 22:31:01 -04:00
Nicolas Patry	816632c9fa	Removing `--release` compat test. - Leaving the one that checks that sampling follows the expected distribution. - Marking the python Unigram.train(..) test as slow - The python Unigram.train(..) test now uses `big.txt` file.	2020-09-02 13:38:14 -04:00
Nicolas Patry	d0366529b7	Use a smaller train file.	2020-09-02 13:38:14 -04:00
Nicolas Patry	7b5c2b92c6	Fixing test dependency.	2020-09-02 13:38:14 -04:00
Nicolas Patry	ee3860c029	Enabling training parity check for tokenizers.UnigramTrainer	2020-09-02 13:38:14 -04:00
Nicolas Patry	558e76f18e	Expose the trainer to Python bindings.	2020-09-02 13:38:14 -04:00
Nicolas Patry	52082b5476	New clippy comments?	2020-09-02 16:32:50 +02:00
Nicolas Patry	c0798acacf	Address @n1t0 comments.	2020-09-02 16:32:50 +02:00
Nicolas Patry	d624645cf3	Attempting to add UnigramTrainer to python bindings.	2020-09-02 16:32:50 +02:00
Nicolas Patry	95e126cd82	Missed *.pyi file.	2020-09-02 16:32:50 +02:00
Nicolas Patry	dd91739ba0	Now spm_parity_check succeeds because we have the correct pre_tokenizer.	2020-09-02 16:32:50 +02:00
Nicolas Patry	e974cfb1c9	Formatting after rebase.	2020-09-02 16:32:50 +02:00
Nicolas Patry	439305eea0	Failing test for compatibility for SentencePieceUnigramTokenizer. - We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A). Could be linked to float precision and hard or impossible to fix (should not hinder model performance) - We are now fusing_unk by default as it's the case with spm_train - We are still failing on at least space deduplication. Probably should be handlded by a pre-tokenizer.	2020-09-02 16:32:50 +02:00
Anthony MOI	bd8dac202c	Add failing test for from_file	2020-09-01 09:53:50 -04:00
Nicolas Patry	76b86f6901	Removing forgotten places.	2020-08-31 14:05:39 -04:00
Nicolas Patry	857948e5b8	Addressing comments: - Remote Deduplication in favor of WhitespaceSplit. - Updated comments	2020-08-31 14:05:39 -04:00
Nicolas Patry	1994dcad6e	Re-enabling Custom Serialize	2020-08-31 14:05:39 -04:00
Nicolas Patry	6887c0f04d	Black pass.	2020-08-31 14:05:39 -04:00
Nicolas Patry	7ed7f0f26a	Adding a 3 new PreTokenizers: - Deduplication : Removes duplicate spaces within strings - Punctuation: Splits punctuation characters as isolated tokens - Sequence: Applies a list of pretokenizers iteratively	2020-08-31 14:05:39 -04:00
Anthony MOI	c036cd4ced	Python - Bump version for 0.9.0.dev0 release	2020-08-21 18:52:29 -04:00
Anthony MOI	32a76b0331	Update CHANGELOGs	2020-08-21 18:52:15 -04:00
Anthony MOI	3d1322f108	Python - Improve and Test EncodeInput extraction	2020-08-21 18:39:49 -04:00
Anthony MOI	14adf18e5b	Python - Extract single pre-tokenized inputs from np.array	2020-08-21 18:39:49 -04:00
Anthony MOI	d919d68889	Python - InputSequence with references when possible	2020-08-21 18:39:49 -04:00
Anthony MOI	504d8c85d8	Remove Tokenizer::normalize This is actually a legacy function that doesn't really make sense now, and is getting really difficult to keep. So we remove it.	2020-08-19 12:42:12 -04:00

1 2 3 4 5 ...

457 Commits