tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Author	SHA1	Message	Date
co63oc	f1faec1756	Fix typos in strings and comments (#1770 )	2025-05-27 08:17:36 +02:00
Nicolas Patry	91393ef75e	Fixing doc. (#1499 ) * Fixing doc. * SentencePieceUnigram and Convert.py still used sentencepiece * stub --------- Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>	2024-04-17 09:32:40 +02:00
Arthur	29fef1e7aa	[`remove black`] And use ruff (#1436 ) * nits * Fixing deps. * Ruff update. * Import order matters. * Fix. * Revert ruff fix. * Visualizer. * Putting back the imports. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-03-12 11:24:21 +01:00
Lysandre Debut	4b0dc6b947	Fix SPM conversions (#686 ) * Fix SPM conversions * Update changelog Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>	2021-05-20 09:55:55 -04:00
Lysandre	e999a7b5f9	Revert "Fix SPM conversions" This reverts commit `e1ffe39764`.	2021-04-21 18:09:58 -04:00
Lysandre	e1ffe39764	Fix SPM conversions	2021-04-21 18:09:49 -04:00
Anthony MOI	96b9972842	Fix SentencePiece tokenizers conversion	2021-02-03 12:44:46 -05:00
Nicolas Patry	598ce61229	Removed now wrong code in `convert.py`, fixed strange black magic.	2020-09-24 08:57:02 +02:00
Nicolas Patry	8f8156fd2c	Adressing first pass of comments.	2020-09-24 08:57:02 +02:00
Nicolas Patry	9d3a93db5b	Going back for `not` fuse_unk by default for BPE, but add a flag to enable it.	2020-09-22 16:27:09 -04:00
Nicolas Patry	033b98ce59	Updating convert scripts with Replace normalizer.	2020-09-22 08:21:38 +02:00
Nicolas Patry	c59b216baa	Fixing convert/check scripts.	2020-09-22 08:21:38 +02:00
Nicolas Patry	b16406c900	Moving StripAccents within normalizer for Albert +XLNet, but now crash in Precompiled. offsets are wrong ?	2020-09-22 08:21:38 +02:00
Nicolas Patry	275ee6d4c4	Making convert script machine agnostic.	2020-09-22 08:21:38 +02:00
Nicolas Patry	2fd1d9cf06	Adding a new convert script, that will convert all python Tokenizer code into a proper Rust Tokenizer format and check it on a file. - Also fuse_unks by default in `tokenizers`'s BPE.	2020-09-22 08:21:38 +02:00
Nicolas Patry	330876ae02	Improvements on spm parity: (#401 ) * Removing all pre_tokenizer logic from Unigram algorithm. * Improving a lot the parity check. - We can now detect a lot more errors - Special cases have been added temporarily. * Adding 2 new normalizers that mimick spm defaut's behavior. * Adding `encoding_optimized` version of the `encode` algorithm. - Removes Lattice allocation. - Changes trie `common_prefix_search` to return an iterator to avoid allocation of the full results. * Trie<char> -> Trie<u8> Another improvement on speed. * [WIP] Attempt to create a Precompiled Normalizer from SPM to be 100% compliant with arbitrary models. * Adding a new `Precompiled` Normalizer that is replacing `SpmNmtNfkc`. - It will be used for direct compatiblity with `Spm` and replace all their custom rules by using directly the normalizer spec embedded within spm files, removing all need for any rules for us. - We need `nom` dependency to parse the binary format of `spm`. - We need to add `sentencepiece_model_pb2.py` file to be able to read the proto file. - We reimplemented their `Darts::DoubleArray` compact trie format. * Fixing a bug with Precompiled normalizer. * Fixing some edge cases (now in tests) with this weird precompiled normalizer. It seems a very handy crafted trie does not prevent from shooting oneself in the foot. Sorry future reader. * Keep API stable for this PR (change of the API should come later #409). - Removed sentencepiece_model_pb2 from binding and add instructions to make `from_spm` work. * Adding model check in `from_spm`. * Adressing @n1t0's comments. * Adding a check to make sure alignments stay correct. Also added a bit more documentation on how Precompiled works. * Extracting `Precompiled` into it's own `spm_precompiled` crate. * Using ranges in `do_nmt`.	2020-09-15 22:21:02 +02:00
Nicolas Patry	ee3860c029	Enabling training parity check for tokenizers.UnigramTrainer	2020-09-02 13:38:14 -04:00
Nicolas Patry	dd91739ba0	Now spm_parity_check succeeds because we have the correct pre_tokenizer.	2020-09-02 16:32:50 +02:00
Nicolas Patry	e974cfb1c9	Formatting after rebase.	2020-09-02 16:32:50 +02:00
Nicolas Patry	439305eea0	Failing test for compatibility for SentencePieceUnigramTokenizer. - We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A). Could be linked to float precision and hard or impossible to fix (should not hinder model performance) - We are now fusing_unk by default as it's the case with spm_train - We are still failing on at least space deduplication. Probably should be handlded by a pre-tokenizer.	2020-09-02 16:32:50 +02:00
Nicolas Patry	6887c0f04d	Black pass.	2020-08-31 14:05:39 -04:00
Sławomir Dadas	0865a9ad55	Python - improve compatibility with sentencepiece in the conversion script	2020-04-11 17:35:50 +02:00
Morgan Funtowicz	be10f542ce	Added SentencePiece and YouTokenToMe model extractors. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-01-08 22:55:00 +01:00

23 Commits