Commit Graph

416 Commits

Author SHA1 Message Date
Nicolas Patry
1994dcad6e Re-enabling Custom Serialize 2020-08-31 14:05:39 -04:00
Nicolas Patry
6887c0f04d Black pass. 2020-08-31 14:05:39 -04:00
Nicolas Patry
7ed7f0f26a Adding a 3 new PreTokenizers:
- Deduplication : Removes duplicate spaces within strings
- Punctuation: Splits punctuation characters as isolated tokens
- Sequence: Applies a list of pretokenizers iteratively
2020-08-31 14:05:39 -04:00
Anthony MOI
c036cd4ced Python - Bump version for 0.9.0.dev0 release 2020-08-21 18:52:29 -04:00
Anthony MOI
32a76b0331 Update CHANGELOGs 2020-08-21 18:52:15 -04:00
Anthony MOI
3d1322f108 Python - Improve and Test EncodeInput extraction 2020-08-21 18:39:49 -04:00
Anthony MOI
14adf18e5b Python - Extract single pre-tokenized inputs from np.array 2020-08-21 18:39:49 -04:00
Anthony MOI
d919d68889 Python - InputSequence with references when possible 2020-08-21 18:39:49 -04:00
Anthony MOI
504d8c85d8 Remove Tokenizer::normalize
This is actually a legacy function that doesn't really make sense now, and is getting really difficult to keep. So we remove it.
2020-08-19 12:42:12 -04:00
Anthony MOI
f92c9955e7 Python - Update bindings 2020-08-19 12:42:12 -04:00
Sebastian Pütz
10a39ba6b4 Add in-place train. 2020-08-04 15:59:33 -04:00
Sebastian Pütz
ac8af63f70 Trainers don't need Arc. 2020-08-04 15:59:33 -04:00
Anthony MOI
363adedb4c Fixes and cleanup, suggestions by @n1t0. 2020-08-04 15:59:33 -04:00
Sebastian Pütz
f6adcf0e7c Remove typetag, bump deps. 2020-08-04 15:59:33 -04:00
Sebastian Puetz
16f75d9efc Ensure serialization works in all expected ways. 2020-08-04 15:59:33 -04:00
Sebastian Puetz
aaf8e932b1 Remove Send + Sync requirements from Model. 2020-08-04 15:59:33 -04:00
Sebastian Puetz
42b810488f Hide generics 2020-08-04 15:59:33 -04:00
Sebastian Pütz
d62adf7195 Remove Container, changes to PyDecoder, cloneable Tokenizer.
* derive Clone on Tokenizer and AddedVocabulary.
* Replace Container with Arc wrapper for Decoders.
* Prefix Rust Decoder types with Py.
* Rename PyDecoder to CustomDecoder.
* Change panic in serializing custom decoder to exception.
* Re-enable training with cloneable Tokenizer.
* Remove unsound Container, use Arc wrappers instead.
2020-08-04 15:59:33 -04:00
Sebastian Pütz
11e86a16c5 Remove Container from PostProcessors, replace with Arc.
* prefix the Python types in Rust with Py.
* remove unsound Container wrappers, replace with Arc.
2020-08-04 15:59:33 -04:00
Sebastian Pütz
b411443128 Remove Container from PreTokenizers, replace with Arc.
* prefix the Python types in Rust with Py, rename PyPretokenizer
  to CustomPretokenizer
* remove unsound Container wrappers, replace with Arc
* change panic on trying to (de-)serialize custom pretokenizer to
  exception
2020-08-04 15:59:33 -04:00
Sebastian Pütz
08b8c48127 Remove Container from Normalizers, replace with Arc.
* prefix the Python types in Rust with Py
* remove unsound Container wrappers, replace with Arc
2020-08-04 15:59:33 -04:00
Sebastian Pütz
83a52c8080 Replace Model and Trainer Containers.
* Implement changes necessary from generic Model in Tokenizer.
* Temporarily disable training in Python since Clone can't be
  derived for Model until all components have been replaced.
* Prefix Python types in Rust with Py.
2020-08-04 15:59:33 -04:00
Anthony MOI
dad70e8e85 Implement suggestions by @sebpuetz
Co-authored-by: Sebastian Pütz <sebastian.puetz@uni-tuebingen.de>
2020-08-03 16:18:59 -04:00
Anthony MOI
7833965dc4 Update Python bindings with new interface 2020-08-03 16:18:59 -04:00
Anthony MOI
904ff24382 New API for PreTokenizer and Model + refactor Tokenizer - WIP 2020-08-03 16:18:59 -04:00
Sebastian Pütz
27e326ab2b Fix deadlocks with custom python components. 2020-08-03 16:17:17 -04:00
Sebastian Pütz
0d7c232f95 Move Python source to subdirectory.
This allows testing versions not built in-place. Otherwise
importing (or testing) in the package root fails without develop
builds.
Replace maturin with setuptools_rust since maturin fails with
proper project structure.
2020-07-25 23:40:47 +02:00
Anthony MOI
c901f86d52 Python - Bump version for 0.8.1 2020-07-20 16:33:48 -04:00
Anthony MOI
157feed9a5 Python - Bump version for 0.8.1.rc2 2020-07-17 13:12:23 -04:00
Setu Shah
1f2cc6ee73 Include license in PyPI package 2020-07-16 14:20:32 -04:00
Anthony MOI
5be375eaea Update CHANGELOGs and bump version for python release 2020-07-06 15:21:47 -04:00
Anthony MOI
e874641cf9 Merge pull request #333 from huggingface/fix-added-tokens
Python - Fix Added token deserialization
2020-07-06 14:52:37 -04:00
Anthony MOI
2194970679 Merge pull request #330 from huggingface/bert-normalization
Improve BertNormalizer behavior
2020-07-06 14:52:23 -04:00
Anthony MOI
d33af1a3be Python - Fix Added token deserialization 2020-07-06 14:46:12 -04:00
Anthony MOI
7a95ffc4fa BertNormalizer has same behavior than original implem 2020-07-06 13:55:18 -04:00
Anthony MOI
8bf482cecc Improve parallelism tracking and warning 2020-07-06 13:05:14 -04:00
आलोक
6fe284dd8d Use supplied UNK token even when vocab absent
If a vocab file isn't provided the supplied unk token (different from [UNK]) gets ignored and later throws an error:
Exception: WordPiece error: Missing [UNK] token from the vocabulary
when trying to encode an input string with an unknown token.
2020-07-05 19:01:04 +05:30
Anthony MOI
6349ca51b3 Python - Bump version for 0.8.0 release 2020-06-26 16:12:26 -04:00
Anthony MOI
8ae1982149 Finally it will be rc4 for transformers 2020-06-26 15:36:08 -04:00
Anthony MOI
5a653869af Try local version for transformers 2020-06-26 15:19:00 -04:00
Anthony MOI
1a08b21329 Python - Bump version for 0.8.0.transformers release 2020-06-26 14:37:22 -04:00
Anthony MOI
bb668bc439 Try with target_family = unix 2020-06-23 16:52:21 -04:00
Anthony MOI
f8b1630aa6 Update CHANGELOGs 2020-06-23 13:32:21 -04:00
Anthony MOI
aa3b39f692 Python - Tests for parallelism with multiprocessing
Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com>
2020-06-23 11:25:39 -04:00
Anthony MOI
ae743f5dc1 Python - Automatically disable parallelism after fork 2020-06-22 20:31:52 -04:00
Anthony MOI
5d20322319 Rust - Fix optional parallelism with par_bridge 2020-06-22 20:31:52 -04:00
Anthony MOI
dce52621c6 Rust - Make parallelism optional 2020-06-22 20:31:52 -04:00
Anthony MOI
74d812d401 Python - Bump version to 0.8.0.rc3 for release 2020-06-22 12:54:31 -04:00
Anthony MOI
c02d4e2202 Python - Improve AddedToken interface 2020-06-19 17:53:46 -04:00
Anthony MOI
a14cd7b219 Python - Bump version to 0.8.0.rc2 for release 2020-06-19 10:48:53 -04:00