tokenizers

mii/tokenizers

Fork 0

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-23 00:35:35 +00:00

Commit Graph

Author	SHA1	Message	Date
Nicolas Patry	439305eea0	Failing test for compatibility for SentencePieceUnigramTokenizer. - We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A). Could be linked to float precision and hard or impossible to fix (should not hinder model performance) - We are now fusing_unk by default as it's the case with spm_train - We are still failing on at least space deduplication. Probably should be handlded by a pre-tokenizer.	2020-09-02 16:32:50 +02:00

Author

SHA1

Message

Date

Nicolas Patry

439305eea0

Failing test for compatibility for SentencePieceUnigramTokenizer.

- We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A).
  Could be linked to float precision and hard or impossible to fix
(should not hinder model performance)

- We are now fusing_unk by default as it's the case with spm_train

- We are still failing on at least space deduplication. Probably should
  be handlded by a pre-tokenizer.

2020-09-02 16:32:50 +02:00

1 Commits