23 Commits

Author SHA1 Message Date
f1faec1756 Fix typos in strings and comments (#1770) 2025-05-27 08:17:36 +02:00
91393ef75e Fixing doc. (#1499)
* Fixing doc.

* SentencePieceUnigram  and Convert.py still used sentencepiece

* stub

---------

Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
2024-04-17 09:32:40 +02:00
29fef1e7aa [remove black] And use ruff (#1436)
* nits

* Fixing deps.

* Ruff update.

* Import order matters.

* Fix.

* Revert ruff fix.

* Visualizer.

* Putting back the imports.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-03-12 11:24:21 +01:00
4b0dc6b947 Fix SPM conversions (#686)
* Fix SPM conversions

* Update changelog

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-05-20 09:55:55 -04:00
e999a7b5f9 Revert "Fix SPM conversions"
This reverts commit e1ffe39764.
2021-04-21 18:09:58 -04:00
e1ffe39764 Fix SPM conversions 2021-04-21 18:09:49 -04:00
96b9972842 Fix SentencePiece tokenizers conversion 2021-02-03 12:44:46 -05:00
598ce61229 Removed now wrong code in convert.py, fixed strange black magic. 2020-09-24 08:57:02 +02:00
8f8156fd2c Adressing first pass of comments. 2020-09-24 08:57:02 +02:00
9d3a93db5b Going back for not fuse_unk by default for BPE, but add a flag to
enable it.
2020-09-22 16:27:09 -04:00
033b98ce59 Updating convert scripts with Replace normalizer. 2020-09-22 08:21:38 +02:00
c59b216baa Fixing convert/check scripts. 2020-09-22 08:21:38 +02:00
b16406c900 Moving StripAccents within normalizer for Albert +XLNet, but now crash
in Precompiled. offsets are wrong ?
2020-09-22 08:21:38 +02:00
275ee6d4c4 Making convert script machine agnostic. 2020-09-22 08:21:38 +02:00
2fd1d9cf06 Adding a new convert script, that will convert all python Tokenizer code
into a proper Rust Tokenizer format and check it on a file.

- Also fuse_unks by default in `tokenizers`'s BPE.
2020-09-22 08:21:38 +02:00
330876ae02 Improvements on spm parity: (#401)
* Removing all pre_tokenizer logic from Unigram algorithm.

* Improving *a lot* the parity check.

- We can now detect a lot more errors
- Special cases have been added temporarily.

* Adding 2 new normalizers that mimick spm defaut's behavior.

* Adding `encoding_optimized` version of the `encode` algorithm.

- Removes Lattice allocation.
- Changes trie `common_prefix_search` to return an iterator to avoid
  allocation of the full results.

* Trie<char> -> Trie<u8> Another improvement on speed.

* [WIP] Attempt to create a Precompiled Normalizer from SPM to be 100%
compliant with arbitrary models.

* Adding a new `Precompiled` Normalizer that is replacing `SpmNmtNfkc`.

- It will be used for direct compatiblity with `Spm` and replace all
their custom rules by using directly the normalizer spec embedded
within spm files, removing all need for any rules for us.
- We need `nom` dependency to parse the binary format of `spm`.
- We need to add `sentencepiece_model_pb2.py` file to be able to read
  the proto file.
- We reimplemented their `Darts::DoubleArray` compact trie format.

* Fixing a bug with Precompiled normalizer.

* Fixing some edge cases (now in tests) with this weird precompiled
normalizer.

It seems a very handy crafted trie does not prevent from shooting
oneself in the foot. Sorry future reader.

* Keep API stable for this PR (change of the API should come later #409).

- Removed sentencepiece_model_pb2 from binding and add instructions to
make `from_spm` work.

* Adding model check in `from_spm`.

* Adressing @n1t0's comments.

* Adding a check to make sure alignments stay correct.

Also added a bit more documentation on how Precompiled works.

* Extracting `Precompiled` into it's own `spm_precompiled` crate.

* Using ranges in `do_nmt`.
2020-09-15 22:21:02 +02:00
ee3860c029 Enabling training parity check for tokenizers.UnigramTrainer 2020-09-02 13:38:14 -04:00
dd91739ba0 Now spm_parity_check succeeds because we have the correct pre_tokenizer. 2020-09-02 16:32:50 +02:00
e974cfb1c9 Formatting after rebase. 2020-09-02 16:32:50 +02:00
439305eea0 Failing test for compatibility for SentencePieceUnigramTokenizer.
- We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A).
  Could be linked to float precision and hard or impossible to fix
(should not hinder model performance)

- We are now fusing_unk by default as it's the case with spm_train

- We are still failing on at least space deduplication. Probably should
  be handlded by a pre-tokenizer.
2020-09-02 16:32:50 +02:00
6887c0f04d Black pass. 2020-08-31 14:05:39 -04:00
0865a9ad55 Python - improve compatibility with sentencepiece in the conversion script 2020-04-11 17:35:50 +02:00
be10f542ce Added SentencePiece and YouTokenToMe model extractors.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-08 22:55:00 +01:00