Anthony MOI
337fe72b13
Python - Bindings for TemplateProcessing
2020-09-10 15:04:19 -04:00
Nicolas Patry
df827d538f
Adding clippy as a linter within the Python binding. ( #388 )
...
* Adding clippy as a linter within the Python binding.
* Missing clippy (dropped commit ??)
2020-09-04 09:09:02 -04:00
Nicolas Patry
efa20202dc
Addressing @n1t0's comments.
2020-09-04 11:57:01 +02:00
Nicolas Patry
7b2caca764
Adding a new pre_tokenizer: Digits.
...
Easier to split on digits:
Digits(individual_digits=False) -> 'Call 123 please' becomes 'Call ',
'123', 'please'
Digits(individual_digits=True) -> 'Call 123 please' becomes 'Call ',
'1', '2', '3', 'please'
2020-09-03 21:03:45 +02:00
Anthony MOI
b8f1eb48cb
Python - Bump version for 0.9.0.dev1 release
2020-09-02 22:31:01 -04:00
Nicolas Patry
816632c9fa
Removing --release compat test.
...
- Leaving the one that checks that sampling follows the expected
distribution.
- Marking the python Unigram.train(..) test as slow
- The python Unigram.train(..) test now uses `big.txt` file.
2020-09-02 13:38:14 -04:00
Nicolas Patry
d0366529b7
Use a smaller train file.
2020-09-02 13:38:14 -04:00
Nicolas Patry
7b5c2b92c6
Fixing test dependency.
2020-09-02 13:38:14 -04:00
Nicolas Patry
ee3860c029
Enabling training parity check for tokenizers.UnigramTrainer
2020-09-02 13:38:14 -04:00
Nicolas Patry
558e76f18e
Expose the trainer to Python bindings.
2020-09-02 13:38:14 -04:00
Nicolas Patry
52082b5476
New clippy comments?
2020-09-02 16:32:50 +02:00
Nicolas Patry
c0798acacf
Address @n1t0 comments.
2020-09-02 16:32:50 +02:00
Nicolas Patry
d624645cf3
Attempting to add UnigramTrainer to python bindings.
2020-09-02 16:32:50 +02:00
Nicolas Patry
95e126cd82
Missed *.pyi file.
2020-09-02 16:32:50 +02:00
Nicolas Patry
dd91739ba0
Now spm_parity_check succeeds because we have the correct pre_tokenizer.
2020-09-02 16:32:50 +02:00
Nicolas Patry
e974cfb1c9
Formatting after rebase.
2020-09-02 16:32:50 +02:00
Nicolas Patry
439305eea0
Failing test for compatibility for SentencePieceUnigramTokenizer.
...
- We are failing on ambiguous tokenization (AAA -> A + AA vs AA + A).
Could be linked to float precision and hard or impossible to fix
(should not hinder model performance)
- We are now fusing_unk by default as it's the case with spm_train
- We are still failing on at least space deduplication. Probably should
be handlded by a pre-tokenizer.
2020-09-02 16:32:50 +02:00
Anthony MOI
bd8dac202c
Add failing test for from_file
2020-09-01 09:53:50 -04:00
Nicolas Patry
76b86f6901
Removing forgotten places.
2020-08-31 14:05:39 -04:00
Nicolas Patry
857948e5b8
Addressing comments:
...
- Remote Deduplication in favor of WhitespaceSplit.
- Updated comments
2020-08-31 14:05:39 -04:00
Nicolas Patry
1994dcad6e
Re-enabling Custom Serialize
2020-08-31 14:05:39 -04:00
Nicolas Patry
6887c0f04d
Black pass.
2020-08-31 14:05:39 -04:00
Nicolas Patry
7ed7f0f26a
Adding a 3 new PreTokenizers:
...
- Deduplication : Removes duplicate spaces within strings
- Punctuation: Splits punctuation characters as isolated tokens
- Sequence: Applies a list of pretokenizers iteratively
2020-08-31 14:05:39 -04:00
Anthony MOI
c036cd4ced
Python - Bump version for 0.9.0.dev0 release
2020-08-21 18:52:29 -04:00
Anthony MOI
32a76b0331
Update CHANGELOGs
2020-08-21 18:52:15 -04:00
Anthony MOI
3d1322f108
Python - Improve and Test EncodeInput extraction
2020-08-21 18:39:49 -04:00
Anthony MOI
14adf18e5b
Python - Extract single pre-tokenized inputs from np.array
2020-08-21 18:39:49 -04:00
Anthony MOI
d919d68889
Python - InputSequence with references when possible
2020-08-21 18:39:49 -04:00
Anthony MOI
504d8c85d8
Remove Tokenizer::normalize
...
This is actually a legacy function that doesn't really make sense now, and is getting really difficult to keep. So we remove it.
2020-08-19 12:42:12 -04:00
Anthony MOI
f92c9955e7
Python - Update bindings
2020-08-19 12:42:12 -04:00
Sebastian Pütz
10a39ba6b4
Add in-place train.
2020-08-04 15:59:33 -04:00
Sebastian Pütz
ac8af63f70
Trainers don't need Arc.
2020-08-04 15:59:33 -04:00
Anthony MOI
363adedb4c
Fixes and cleanup, suggestions by @n1t0.
2020-08-04 15:59:33 -04:00
Sebastian Pütz
f6adcf0e7c
Remove typetag, bump deps.
2020-08-04 15:59:33 -04:00
Sebastian Puetz
16f75d9efc
Ensure serialization works in all expected ways.
2020-08-04 15:59:33 -04:00
Sebastian Puetz
aaf8e932b1
Remove Send + Sync requirements from Model.
2020-08-04 15:59:33 -04:00
Sebastian Puetz
42b810488f
Hide generics
2020-08-04 15:59:33 -04:00
Sebastian Pütz
d62adf7195
Remove Container, changes to PyDecoder, cloneable Tokenizer.
...
* derive Clone on Tokenizer and AddedVocabulary.
* Replace Container with Arc wrapper for Decoders.
* Prefix Rust Decoder types with Py.
* Rename PyDecoder to CustomDecoder.
* Change panic in serializing custom decoder to exception.
* Re-enable training with cloneable Tokenizer.
* Remove unsound Container, use Arc wrappers instead.
2020-08-04 15:59:33 -04:00
Sebastian Pütz
11e86a16c5
Remove Container from PostProcessors, replace with Arc.
...
* prefix the Python types in Rust with Py.
* remove unsound Container wrappers, replace with Arc.
2020-08-04 15:59:33 -04:00
Sebastian Pütz
b411443128
Remove Container from PreTokenizers, replace with Arc.
...
* prefix the Python types in Rust with Py, rename PyPretokenizer
to CustomPretokenizer
* remove unsound Container wrappers, replace with Arc
* change panic on trying to (de-)serialize custom pretokenizer to
exception
2020-08-04 15:59:33 -04:00
Sebastian Pütz
08b8c48127
Remove Container from Normalizers, replace with Arc.
...
* prefix the Python types in Rust with Py
* remove unsound Container wrappers, replace with Arc
2020-08-04 15:59:33 -04:00
Sebastian Pütz
83a52c8080
Replace Model and Trainer Containers.
...
* Implement changes necessary from generic Model in Tokenizer.
* Temporarily disable training in Python since Clone can't be
derived for Model until all components have been replaced.
* Prefix Python types in Rust with Py.
2020-08-04 15:59:33 -04:00
Anthony MOI
dad70e8e85
Implement suggestions by @sebpuetz
...
Co-authored-by: Sebastian Pütz <sebastian.puetz@uni-tuebingen.de >
2020-08-03 16:18:59 -04:00
Anthony MOI
7833965dc4
Update Python bindings with new interface
2020-08-03 16:18:59 -04:00
Anthony MOI
904ff24382
New API for PreTokenizer and Model + refactor Tokenizer - WIP
2020-08-03 16:18:59 -04:00
Sebastian Pütz
27e326ab2b
Fix deadlocks with custom python components.
2020-08-03 16:17:17 -04:00
Sebastian Pütz
0d7c232f95
Move Python source to subdirectory.
...
This allows testing versions not built in-place. Otherwise
importing (or testing) in the package root fails without develop
builds.
Replace maturin with setuptools_rust since maturin fails with
proper project structure.
2020-07-25 23:40:47 +02:00
Anthony MOI
c901f86d52
Python - Bump version for 0.8.1
2020-07-20 16:33:48 -04:00
Anthony MOI
157feed9a5
Python - Bump version for 0.8.1.rc2
2020-07-17 13:12:23 -04:00
Setu Shah
1f2cc6ee73
Include license in PyPI package
2020-07-16 14:20:32 -04:00