Commit Graph

300 Commits

Author SHA1 Message Date
Anthony MOI
a9be177185 Update CHANGELOGs 2020-03-10 13:12:34 -04:00
Anthony MOI
28f022058c Keep default values as true 2020-03-10 12:58:53 -04:00
Anthony MOI
45f3eaaf72 Update bindings and typings 2020-03-10 12:28:24 -04:00
Anthony MOI
efbbfea558 Update ByteLevel PostProcessor 2020-03-10 12:05:04 -04:00
Anthony MOI
7e9003ccb7 Python - Update bindings 2020-03-09 18:37:03 -04:00
Anthony MOI
86d2e90ad2 Update CHANGELOGs 2020-03-06 17:44:44 -05:00
Anthony MOI
d778ed5e0a Python - Update README and implementation 2020-03-06 17:44:44 -05:00
Anthony MOI
52180a9179 Python - Add ByteLevel PostProcessor 2020-03-06 17:44:44 -05:00
Anthony MOI
b60eef5245 Python - Make style 2020-03-06 17:44:44 -05:00
Anthony MOI
d8e7a830b2 Update CHANGELOGs 2020-03-06 17:44:34 -05:00
Anthony MOI
b2e5f54b6f Python - Fix ByteLevelBPETokenizer implementation 2020-03-06 17:44:03 -05:00
Anthony MOI
f1460fadb9 Python - Update docs and implementations 2020-03-06 17:44:03 -05:00
Anthony MOI
2393506dc7 Python - Add ByteLevel Normalizer 2020-03-06 17:44:03 -05:00
Anthony MOI
47cef0e13a Python - Fix BPE and WordPiece builders usage 2020-03-06 12:20:39 -05:00
Anthony MOI
4b596e19dd Rust - Improve training progress for multiple files 2020-03-03 11:04:24 -05:00
Anthony MOI
8e791791d1 Python - prepare for release 2020-03-02 14:56:42 -05:00
Anthony MOI
4deeb9511f Update CHANGELOGs 2020-03-02 14:37:17 -05:00
Anthony MOI
f8f0702d98 Fix LongestFirst truncation strategy 2020-02-29 16:26:13 -05:00
Anthony MOI
657f8b6c15 Rust & Python - Update CHANGELOGs 2020-02-26 11:30:44 -05:00
Anthony MOI
3b10d640d5 Rust & Python - Update CHANGELOGs 2020-02-26 10:51:40 -05:00
Anthony MOI
2425fe877d Python - Update CHANGELOG 2020-02-26 09:31:17 -05:00
Anthony MOI
61b4c9c30a Python - Add missing tokens to BertWordPieceTokenizer 2020-02-26 09:21:54 -05:00
Anthony MOI
440e8e9bd9 Python - Bump version for release 2020-02-24 16:08:49 -05:00
Anthony MOI
be08d9574c Python - Add Changelog 2020-02-24 10:12:50 -05:00
Anthony MOI
999088ef94 Python - Bump version for release 2020-02-24 09:56:08 -05:00
Morgan Funtowicz
817b760ab9 Make name parameter Optional[str] on BaseTokenizer
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-22 14:57:43 +01:00
Morgan Funtowicz
d274a7691d Avoid breaking changes and let parameter name be Optional.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-22 14:56:59 +01:00
Morgan Funtowicz
0fc8be9d69 Formatting for python binding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-22 00:17:44 +01:00
Morgan Funtowicz
f88a6b40ac Make parameter name on Model.save() optional.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-22 00:01:32 +01:00
Anthony MOI
11dd6c8bae Python - Bump version for release 2020-02-18 18:49:11 -05:00
Anthony MOI
41929462c7 Python - Add classifiers 2020-02-18 18:48:21 -05:00
Anthony MOI
d8a73c89a7 Python - Add Encoding length 2020-02-18 18:24:13 -05:00
Anthony MOI
d48fdbe057 Python - Only add special tokens when in-vocabulary 2020-02-18 17:27:27 -05:00
Anthony MOI
5daf1eea86 Python - Replace last BPETokenizer occurences 2020-02-18 16:25:59 -05:00
Anthony MOI
f263d7651f Python - RustFmt 2020-02-18 15:07:34 -05:00
Anthony MOI
8e9fae6be4 Python - Add check-style to Makefile 2020-02-18 11:11:07 -05:00
Anthony MOI
81be207819 Python - Black auto formatting 2020-02-18 10:45:36 -05:00
Anthony MOI
4706151c32 Python - Add Makefile with Black formatting 2020-02-18 10:45:10 -05:00
Anthony MOI
1509f747af Python - Uniformize implementations parameters 2020-02-18 10:27:10 -05:00
MOI Anthony
3512bd3400 Merge pull request #149 from colinclement/master
Allow dropout option in ByteLevelBPETokenizer
2020-02-18 09:59:40 -05:00
Morgan Funtowicz
891dd4adb8 Fix invalid num_added_tokens method call in BaseTokenizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-17 15:32:34 +01:00
Funtowicz Morgan
bb8321ac0d Add Strip normalizer (#140)
* WIP strip.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Rust StripNormalizer

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Allow to specify strip direction

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed StripNormalizer to Strip

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added Python binding.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Makes Strip python compatible with pythonic constructor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Run RustFmt

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Clippy next ofc.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move lstrip and rstrip on NormalizedString

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* implment strip() for normalizer + unittests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add some more unittests on edge cases.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* clippy and fmt.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Simplify strip and fix offsets

* Python - Update strip bindings with default values

Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
2020-02-17 11:26:40 +01:00
Colin Clement
e591cfce7b pass through dropout option in ByteLevelBPETokenizer 2020-02-15 01:58:55 +00:00
MOI Anthony
3cac26cdb2 Merge pull request #147 from huggingface/wordpiece-cleanup
Wordpiece Decoder cleanup
2020-02-14 13:12:15 -05:00
Funtowicz Morgan
c4bac6aeeb Expose num_added_tokens on Python side (#146)
* Expose num_added_tokens on Python side without the need to pass an Encoding to added_tokens.

This allows to compute the max sentence length for single/pair inputs without actually the need to have an Encoding structure.
As the number of added tokens is fixed and static during compilation it allows more flexible usage of the method.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed num_added_tokens to num_special_tokens_to_add.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-14 10:55:20 +00:00
Anthony MOI
1907b74d1c Python - Bindings for Wordpiece decoder's cleanup 2020-02-13 17:50:37 -05:00
Anthony MOI
5bd93ee822 Python - hotfix BertWordPieceTokenizer decoder 2020-02-13 16:31:00 -05:00
Anthony MOI
bbbd97c7e1 Python - Bump version for release 2020-02-11 08:15:11 -05:00
Anthony MOI
08ce105195 Python - Hotfix WordPieceTrainer constructor 2020-02-11 08:13:57 -05:00
Anthony MOI
c1ddfdac8c Python - bump version for release 2020-02-10 23:23:27 -05:00