Commit Graph

17 Commits

Author SHA1 Message Date
Anthony MOI
337fe72b13 Python - Bindings for TemplateProcessing 2020-09-10 15:04:19 -04:00
Nicolas Patry
7b2caca764 Adding a new pre_tokenizer: Digits.
Easier to split on digits:

Digits(individual_digits=False) -> 'Call 123 please' becomes 'Call ',
'123', 'please'
Digits(individual_digits=True) -> 'Call 123 please' becomes 'Call ',
'1', '2', '3', 'please'
2020-09-03 21:03:45 +02:00
Anthony MOI
b8f1eb48cb Python - Bump version for 0.9.0.dev1 release 2020-09-02 22:31:01 -04:00
Nicolas Patry
558e76f18e Expose the trainer to Python bindings. 2020-09-02 13:38:14 -04:00
Nicolas Patry
c0798acacf Address @n1t0 comments. 2020-09-02 16:32:50 +02:00
Nicolas Patry
95e126cd82 Missed *.pyi file. 2020-09-02 16:32:50 +02:00
Nicolas Patry
dd91739ba0 Now spm_parity_check succeeds because we have the correct pre_tokenizer. 2020-09-02 16:32:50 +02:00
Nicolas Patry
e974cfb1c9 Formatting after rebase. 2020-09-02 16:32:50 +02:00
Nicolas Patry
439305eea0 Failing test for compatibility for SentencePieceUnigramTokenizer.
- We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A).
  Could be linked to float precision and hard or impossible to fix
(should not hinder model performance)

- We are now fusing_unk by default as it's the case with spm_train

- We are still failing on at least space deduplication. Probably should
  be handlded by a pre-tokenizer.
2020-09-02 16:32:50 +02:00
Nicolas Patry
76b86f6901 Removing forgotten places. 2020-08-31 14:05:39 -04:00
Nicolas Patry
857948e5b8 Addressing comments:
- Remote Deduplication in favor of WhitespaceSplit.
- Updated comments
2020-08-31 14:05:39 -04:00
Nicolas Patry
6887c0f04d Black pass. 2020-08-31 14:05:39 -04:00
Nicolas Patry
7ed7f0f26a Adding a 3 new PreTokenizers:
- Deduplication : Removes duplicate spaces within strings
- Punctuation: Splits punctuation characters as isolated tokens
- Sequence: Applies a list of pretokenizers iteratively
2020-08-31 14:05:39 -04:00
Anthony MOI
c036cd4ced Python - Bump version for 0.9.0.dev0 release 2020-08-21 18:52:29 -04:00
Anthony MOI
504d8c85d8 Remove Tokenizer::normalize
This is actually a legacy function that doesn't really make sense now, and is getting really difficult to keep. So we remove it.
2020-08-19 12:42:12 -04:00
Anthony MOI
7833965dc4 Update Python bindings with new interface 2020-08-03 16:18:59 -04:00
Sebastian Pütz
0d7c232f95 Move Python source to subdirectory.
This allows testing versions not built in-place. Otherwise
importing (or testing) in the package root fails without develop
builds.
Replace maturin with setuptools_rust since maturin fails with
proper project structure.
2020-07-25 23:40:47 +02:00