Commit Graph

457 Commits

Author SHA1 Message Date
Anthony MOI
003d2ac6fb Python - Update PyToken bindings 2020-09-23 15:50:01 -04:00
Anthony MOI
fce6998dcf Python - Add bindings for NormalizedString 2020-09-23 15:50:01 -04:00
Anthony MOI
e4b10e0fd9 Python - Add RefMutGuard to safely share &mut 2020-09-23 15:50:01 -04:00
Anthony MOI
a42e13a644 Setup black format in pyproject.toml 2020-09-23 11:58:35 -04:00
Nicolas Patry
9d3a93db5b Going back for not fuse_unk by default for BPE, but add a flag to
enable it.
2020-09-22 16:27:09 -04:00
Anthony MOI
940f8bd8fa Update PyO3 (#426) 2020-09-22 12:00:20 -04:00
Nicolas Patry
c536b4992b Move to dev3 build. 2020-09-22 08:21:38 +02:00
Nicolas Patry
07197e8e35 Move to spm_precompiled 0.1.2 for smaller binary string. 2020-09-22 08:21:38 +02:00
Nicolas Patry
033b98ce59 Updating convert scripts with Replace normalizer. 2020-09-22 08:21:38 +02:00
Nicolas Patry
c59b216baa Fixing convert/check scripts. 2020-09-22 08:21:38 +02:00
Nicolas Patry
c0b9229833 Fixed vietnamese bug, now we have a thai bug. 2020-09-22 08:21:38 +02:00
Nicolas Patry
b16406c900 Moving StripAccents within normalizer for Albert +XLNet, but now crash
in Precompiled. offsets are wrong ?
2020-09-22 08:21:38 +02:00
Nicolas Patry
275ee6d4c4 Making convert script machine agnostic. 2020-09-22 08:21:38 +02:00
Nicolas Patry
2fd1d9cf06 Adding a new convert script, that will convert all python Tokenizer code
into a proper Rust Tokenizer format and check it on a file.

- Also fuse_unks by default in `tokenizers`'s BPE.
2020-09-22 08:21:38 +02:00
Nicolas Patry
aea22a4004 Adding node bindings.
- simplify normalizer.
- simplify python bindings.
2020-09-18 12:24:39 +02:00
Nicolas Patry
792d618006 Adding a new "Replace" normalizer that takes a string and replaces it
with another String (for now).
2020-09-18 12:24:39 +02:00
Nicolas Patry
75464734df Adding a new normalizer that strips accents by removing combining (#416)
* Adding a new normalizer that strips accents by removing combining

characters in unicode strings.

* Adding Node bindings

+ better normalizer impl.

* Doc comment -> Regular comment.
2020-09-17 09:49:41 +02:00
Nicolas Patry
330876ae02 Improvements on spm parity: (#401)
* Removing all pre_tokenizer logic from Unigram algorithm.

* Improving *a lot* the parity check.

- We can now detect a lot more errors
- Special cases have been added temporarily.

* Adding 2 new normalizers that mimick spm defaut's behavior.

* Adding `encoding_optimized` version of the `encode` algorithm.

- Removes Lattice allocation.
- Changes trie `common_prefix_search` to return an iterator to avoid
  allocation of the full results.

* Trie<char> -> Trie<u8> Another improvement on speed.

* [WIP] Attempt to create a Precompiled Normalizer from SPM to be 100%
compliant with arbitrary models.

* Adding a new `Precompiled` Normalizer that is replacing `SpmNmtNfkc`.

- It will be used for direct compatiblity with `Spm` and replace all
their custom rules by using directly the normalizer spec embedded
within spm files, removing all need for any rules for us.
- We need `nom` dependency to parse the binary format of `spm`.
- We need to add `sentencepiece_model_pb2.py` file to be able to read
  the proto file.
- We reimplemented their `Darts::DoubleArray` compact trie format.

* Fixing a bug with Precompiled normalizer.

* Fixing some edge cases (now in tests) with this weird precompiled
normalizer.

It seems a very handy crafted trie does not prevent from shooting
oneself in the foot. Sorry future reader.

* Keep API stable for this PR (change of the API should come later #409).

- Removed sentencepiece_model_pb2 from binding and add instructions to
make `from_spm` work.

* Adding model check in `from_spm`.

* Adressing @n1t0's comments.

* Adding a check to make sure alignments stay correct.

Also added a bit more documentation on how Precompiled works.

* Extracting `Precompiled` into it's own `spm_precompiled` crate.

* Using ranges in `do_nmt`.
2020-09-15 22:21:02 +02:00
Nicolas Patry
62c3d40f11 Upgrading dependencies (esaxx-rs to build). 2020-09-14 13:33:15 +02:00
Anthony MOI
fee1d4e8a3 TemplateProcessing - Add @narsil suggestions
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2020-09-10 15:04:19 -04:00
Anthony MOI
b7df6539e6 TemplateProcessing: Update CHANGELOGs 2020-09-10 15:04:19 -04:00
Anthony MOI
337fe72b13 Python - Bindings for TemplateProcessing 2020-09-10 15:04:19 -04:00
Nicolas Patry
df827d538f Adding clippy as a linter within the Python binding. (#388)
* Adding clippy as a linter within the Python binding.

* Missing clippy (dropped commit ??)
2020-09-04 09:09:02 -04:00
Nicolas Patry
efa20202dc Addressing @n1t0's comments. 2020-09-04 11:57:01 +02:00
Nicolas Patry
7b2caca764 Adding a new pre_tokenizer: Digits.
Easier to split on digits:

Digits(individual_digits=False) -> 'Call 123 please' becomes 'Call ',
'123', 'please'
Digits(individual_digits=True) -> 'Call 123 please' becomes 'Call ',
'1', '2', '3', 'please'
2020-09-03 21:03:45 +02:00
Anthony MOI
b8f1eb48cb Python - Bump version for 0.9.0.dev1 release 2020-09-02 22:31:01 -04:00
Nicolas Patry
816632c9fa Removing --release compat test.
- Leaving the one that checks that sampling follows the expected
distribution.
- Marking the python Unigram.train(..) test as slow
- The python Unigram.train(..) test now uses `big.txt` file.
2020-09-02 13:38:14 -04:00
Nicolas Patry
d0366529b7 Use a smaller train file. 2020-09-02 13:38:14 -04:00
Nicolas Patry
7b5c2b92c6 Fixing test dependency. 2020-09-02 13:38:14 -04:00
Nicolas Patry
ee3860c029 Enabling training parity check for tokenizers.UnigramTrainer 2020-09-02 13:38:14 -04:00
Nicolas Patry
558e76f18e Expose the trainer to Python bindings. 2020-09-02 13:38:14 -04:00
Nicolas Patry
52082b5476 New clippy comments? 2020-09-02 16:32:50 +02:00
Nicolas Patry
c0798acacf Address @n1t0 comments. 2020-09-02 16:32:50 +02:00
Nicolas Patry
d624645cf3 Attempting to add UnigramTrainer to python bindings. 2020-09-02 16:32:50 +02:00
Nicolas Patry
95e126cd82 Missed *.pyi file. 2020-09-02 16:32:50 +02:00
Nicolas Patry
dd91739ba0 Now spm_parity_check succeeds because we have the correct pre_tokenizer. 2020-09-02 16:32:50 +02:00
Nicolas Patry
e974cfb1c9 Formatting after rebase. 2020-09-02 16:32:50 +02:00
Nicolas Patry
439305eea0 Failing test for compatibility for SentencePieceUnigramTokenizer.
- We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A).
  Could be linked to float precision and hard or impossible to fix
(should not hinder model performance)

- We are now fusing_unk by default as it's the case with spm_train

- We are still failing on at least space deduplication. Probably should
  be handlded by a pre-tokenizer.
2020-09-02 16:32:50 +02:00
Anthony MOI
bd8dac202c Add failing test for from_file 2020-09-01 09:53:50 -04:00
Nicolas Patry
76b86f6901 Removing forgotten places. 2020-08-31 14:05:39 -04:00
Nicolas Patry
857948e5b8 Addressing comments:
- Remote Deduplication in favor of WhitespaceSplit.
- Updated comments
2020-08-31 14:05:39 -04:00
Nicolas Patry
1994dcad6e Re-enabling Custom Serialize 2020-08-31 14:05:39 -04:00
Nicolas Patry
6887c0f04d Black pass. 2020-08-31 14:05:39 -04:00
Nicolas Patry
7ed7f0f26a Adding a 3 new PreTokenizers:
- Deduplication : Removes duplicate spaces within strings
- Punctuation: Splits punctuation characters as isolated tokens
- Sequence: Applies a list of pretokenizers iteratively
2020-08-31 14:05:39 -04:00
Anthony MOI
c036cd4ced Python - Bump version for 0.9.0.dev0 release 2020-08-21 18:52:29 -04:00
Anthony MOI
32a76b0331 Update CHANGELOGs 2020-08-21 18:52:15 -04:00
Anthony MOI
3d1322f108 Python - Improve and Test EncodeInput extraction 2020-08-21 18:39:49 -04:00
Anthony MOI
14adf18e5b Python - Extract single pre-tokenized inputs from np.array 2020-08-21 18:39:49 -04:00
Anthony MOI
d919d68889 Python - InputSequence with references when possible 2020-08-21 18:39:49 -04:00
Anthony MOI
504d8c85d8 Remove Tokenizer::normalize
This is actually a legacy function that doesn't really make sense now, and is getting really difficult to keep. So we remove it.
2020-08-19 12:42:12 -04:00