Commit Graph

257 Commits

Author SHA1 Message Date
Anthony MOI
d953d58cee Rust - Fix offsets when there are added tokens 2020-03-19 12:53:03 -04:00
Anthony MOI
d53de0e2da Python - Expose normalize on BaseTokenizer 2020-03-18 16:44:31 -04:00
Anthony MOI
ae0d330907 Update CHANGELOGs 2020-03-18 16:42:27 -04:00
Anthony MOI
60a4fb35f4 Python - Update bindings 2020-03-16 10:36:42 -04:00
Morgan Funtowicz
505bfbba82 Fix invalid error messages. 2020-03-12 15:38:29 +01:00
Morgan Funtowicz
5ed1f26c71 Throw a more meaningful error when provided python input is None. 2020-03-12 10:59:05 +01:00
Anthony MOI
257360acec Python - encode & encode batch with add_special_tokens 2020-03-10 16:21:10 -04:00
Anthony MOI
a9be177185 Update CHANGELOGs 2020-03-10 13:12:34 -04:00
Anthony MOI
28f022058c Keep default values as true 2020-03-10 12:58:53 -04:00
Anthony MOI
45f3eaaf72 Update bindings and typings 2020-03-10 12:28:24 -04:00
Anthony MOI
efbbfea558 Update ByteLevel PostProcessor 2020-03-10 12:05:04 -04:00
Anthony MOI
7e9003ccb7 Python - Update bindings 2020-03-09 18:37:03 -04:00
Anthony MOI
86d2e90ad2 Update CHANGELOGs 2020-03-06 17:44:44 -05:00
Anthony MOI
d778ed5e0a Python - Update README and implementation 2020-03-06 17:44:44 -05:00
Anthony MOI
52180a9179 Python - Add ByteLevel PostProcessor 2020-03-06 17:44:44 -05:00
Anthony MOI
b60eef5245 Python - Make style 2020-03-06 17:44:44 -05:00
Anthony MOI
d8e7a830b2 Update CHANGELOGs 2020-03-06 17:44:34 -05:00
Anthony MOI
b2e5f54b6f Python - Fix ByteLevelBPETokenizer implementation 2020-03-06 17:44:03 -05:00
Anthony MOI
f1460fadb9 Python - Update docs and implementations 2020-03-06 17:44:03 -05:00
Anthony MOI
2393506dc7 Python - Add ByteLevel Normalizer 2020-03-06 17:44:03 -05:00
Anthony MOI
47cef0e13a Python - Fix BPE and WordPiece builders usage 2020-03-06 12:20:39 -05:00
Anthony MOI
4b596e19dd Rust - Improve training progress for multiple files 2020-03-03 11:04:24 -05:00
Anthony MOI
8e791791d1 Python - prepare for release 2020-03-02 14:56:42 -05:00
Anthony MOI
4deeb9511f Update CHANGELOGs 2020-03-02 14:37:17 -05:00
Anthony MOI
f8f0702d98 Fix LongestFirst truncation strategy 2020-02-29 16:26:13 -05:00
Anthony MOI
657f8b6c15 Rust & Python - Update CHANGELOGs 2020-02-26 11:30:44 -05:00
Anthony MOI
3b10d640d5 Rust & Python - Update CHANGELOGs 2020-02-26 10:51:40 -05:00
Anthony MOI
2425fe877d Python - Update CHANGELOG 2020-02-26 09:31:17 -05:00
Anthony MOI
61b4c9c30a Python - Add missing tokens to BertWordPieceTokenizer 2020-02-26 09:21:54 -05:00
Anthony MOI
440e8e9bd9 Python - Bump version for release 2020-02-24 16:08:49 -05:00
Anthony MOI
be08d9574c Python - Add Changelog 2020-02-24 10:12:50 -05:00
Anthony MOI
999088ef94 Python - Bump version for release 2020-02-24 09:56:08 -05:00
Morgan Funtowicz
817b760ab9 Make name parameter Optional[str] on BaseTokenizer
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-22 14:57:43 +01:00
Morgan Funtowicz
d274a7691d Avoid breaking changes and let parameter name be Optional.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-22 14:56:59 +01:00
Morgan Funtowicz
0fc8be9d69 Formatting for python binding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-22 00:17:44 +01:00
Morgan Funtowicz
f88a6b40ac Make parameter name on Model.save() optional.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-22 00:01:32 +01:00
Anthony MOI
11dd6c8bae Python - Bump version for release 2020-02-18 18:49:11 -05:00
Anthony MOI
41929462c7 Python - Add classifiers 2020-02-18 18:48:21 -05:00
Anthony MOI
d8a73c89a7 Python - Add Encoding length 2020-02-18 18:24:13 -05:00
Anthony MOI
d48fdbe057 Python - Only add special tokens when in-vocabulary 2020-02-18 17:27:27 -05:00
Anthony MOI
5daf1eea86 Python - Replace last BPETokenizer occurences 2020-02-18 16:25:59 -05:00
Anthony MOI
f263d7651f Python - RustFmt 2020-02-18 15:07:34 -05:00
Anthony MOI
8e9fae6be4 Python - Add check-style to Makefile 2020-02-18 11:11:07 -05:00
Anthony MOI
81be207819 Python - Black auto formatting 2020-02-18 10:45:36 -05:00
Anthony MOI
4706151c32 Python - Add Makefile with Black formatting 2020-02-18 10:45:10 -05:00
Anthony MOI
1509f747af Python - Uniformize implementations parameters 2020-02-18 10:27:10 -05:00
MOI Anthony
3512bd3400 Merge pull request #149 from colinclement/master
Allow dropout option in ByteLevelBPETokenizer
2020-02-18 09:59:40 -05:00
Morgan Funtowicz
891dd4adb8 Fix invalid num_added_tokens method call in BaseTokenizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-17 15:32:34 +01:00
Funtowicz Morgan
bb8321ac0d Add Strip normalizer (#140)
* WIP strip.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Rust StripNormalizer

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Allow to specify strip direction

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed StripNormalizer to Strip

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added Python binding.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Makes Strip python compatible with pythonic constructor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Run RustFmt

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Clippy next ofc.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move lstrip and rstrip on NormalizedString

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* implment strip() for normalizer + unittests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add some more unittests on edge cases.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* clippy and fmt.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Simplify strip and fix offsets

* Python - Update strip bindings with default values

Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
2020-02-17 11:26:40 +01:00
Colin Clement
e591cfce7b pass through dropout option in ByteLevelBPETokenizer 2020-02-15 01:58:55 +00:00