Commit Graph

555 Commits

Author SHA1 Message Date
Anthony MOI
f263d7651f Python - RustFmt 2020-02-18 15:07:34 -05:00
Anthony MOI
cdd8f49120 Python - Add CI with build and quality steps 2020-02-18 15:07:34 -05:00
Anthony MOI
8e9fae6be4 Python - Add check-style to Makefile 2020-02-18 11:11:07 -05:00
Anthony MOI
81be207819 Python - Black auto formatting 2020-02-18 10:45:36 -05:00
Anthony MOI
4706151c32 Python - Add Makefile with Black formatting 2020-02-18 10:45:10 -05:00
Anthony MOI
1509f747af Python - Uniformize implementations parameters 2020-02-18 10:27:10 -05:00
MOI Anthony
3512bd3400 Merge pull request #149 from colinclement/master
Allow dropout option in ByteLevelBPETokenizer
2020-02-18 09:59:40 -05:00
Morgan Funtowicz
891dd4adb8 Fix invalid num_added_tokens method call in BaseTokenizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-17 15:32:34 +01:00
Funtowicz Morgan
bb8321ac0d Add Strip normalizer (#140)
* WIP strip.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Rust StripNormalizer

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Allow to specify strip direction

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed StripNormalizer to Strip

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added Python binding.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Makes Strip python compatible with pythonic constructor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Run RustFmt

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Clippy next ofc.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move lstrip and rstrip on NormalizedString

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* implment strip() for normalizer + unittests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add some more unittests on edge cases.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* clippy and fmt.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Simplify strip and fix offsets

* Python - Update strip bindings with default values

Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
2020-02-17 11:26:40 +01:00
Colin Clement
e591cfce7b pass through dropout option in ByteLevelBPETokenizer 2020-02-15 01:58:55 +00:00
MOI Anthony
3cac26cdb2 Merge pull request #147 from huggingface/wordpiece-cleanup
Wordpiece Decoder cleanup
2020-02-14 13:12:15 -05:00
Pierric Cistac
2aa8366a14 node: add cleanup typings on wordpiece decoder 2020-02-14 10:26:28 -05:00
Funtowicz Morgan
c4bac6aeeb Expose num_added_tokens on Python side (#146)
* Expose num_added_tokens on Python side without the need to pass an Encoding to added_tokens.

This allows to compute the max sentence length for single/pair inputs without actually the need to have an Encoding structure.
As the number of added tokens is fixed and static during compilation it allows more flexible usage of the method.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed num_added_tokens to num_special_tokens_to_add.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-14 10:55:20 +00:00
Anthony MOI
23de3d1cc4 Node - Bindings for Wordpiece decoders's cleanup 2020-02-13 17:52:15 -05:00
Anthony MOI
1907b74d1c Python - Bindings for Wordpiece decoder's cleanup 2020-02-13 17:50:37 -05:00
Anthony MOI
4271c173e5 Add tokenization cleanup to WordPiece Decoder 2020-02-13 17:49:43 -05:00
Anthony MOI
5bd93ee822 Python - hotfix BertWordPieceTokenizer decoder 2020-02-13 16:31:00 -05:00
MOI Anthony
a28d83a1b1 Merge pull request #138 from 0xflotus/patch-1
fixed corresponding typo
2020-02-13 11:40:56 -05:00
0xflotus
d5e3e6b3e3 fixed corresponding typo 2020-02-11 19:03:34 +01:00
Pierric Cistac
5e612669bb node: version 0.4.1 2020-02-11 09:54:17 -05:00
Anthony MOI
bbbd97c7e1 Python - Bump version for release 2020-02-11 08:15:11 -05:00
Anthony MOI
08ce105195 Python - Hotfix WordPieceTrainer constructor 2020-02-11 08:13:57 -05:00
Anthony MOI
c1ddfdac8c Python - bump version for release 2020-02-10 23:23:27 -05:00
Anthony MOI
7d4fb6faf7 Fix clippy & rustfmt warnings 2020-02-10 23:22:23 -05:00
MOI Anthony
583c741f15 Merge pull request #134 from Mansterteddy/master
Add is_bert_punc function.
2020-02-10 23:16:46 -05:00
Yuan Zhang (RELEVANCE)
7df1809c45 Add is_bert_punc function. 2020-02-11 11:04:24 +08:00
Anthony MOI
3c0164ef75 Python - Bump version for release 2020-02-10 16:07:32 -05:00
Anthony MOI
43a989775e Python - Improve typings 2020-02-10 13:53:07 -05:00
Anthony MOI
dd9270a406 Python - Fix example.py for GPT-2
cc @mfuntowicz `from_pretrained` takes only on argument. Do you know if
we can make this compatible otherwise?
2020-02-10 13:51:03 -05:00
Anthony MOI
8585b761d1 Python - More updates to the new API 2020-02-10 11:57:30 -05:00
Anthony MOI
505c428f72 Python - Update example.py with new API 2020-02-10 11:55:14 -05:00
MOI Anthony
07d42cfa22 Merge pull request #131 from ljos/python/pythonic_initialization
Python: Implement __new__ to make constructor available
2020-02-10 11:48:20 -05:00
Bjarte Johansen
6a4976ddd6 Implement __new__ for PreTokenizers
__new__ allows PreTokenizers to be instansiated through the python
constructor.
2020-02-10 10:43:53 +01:00
Bjarte Johansen
f32e0c09fc Implement __new__ for PostProcessors
Allows PostProcessors to be instansiated through python class constructor.
2020-02-10 10:43:53 +01:00
Bjarte Johansen
03508826cb Implement __new__ on Decoders
Allow decoders to be initialized from python using the class
constructor.
2020-02-10 10:43:53 +01:00
Bjarte Johansen
4971e9608d Implement __new__ on Trainers
__new__ allows Trainers to be initialized in the normal python
fashion.
2020-02-10 10:43:29 +01:00
Bjarte Johansen
0e5d81b400 Implement __new__ on Normalizers
__new__ allows Normalizers to be initialized as normal python
objects. This also means that Normalizers are given the correct class
name.
2020-02-10 10:43:19 +01:00
MOI Anthony
cbecaab1de Merge pull request #132 from huggingface/contributors-hall-of-fame
Add contributors-hall-of-fame
2020-02-07 21:56:34 -05:00
Clement
948214eeba Move up the binding in the readme 2020-02-07 11:11:39 -05:00
Clement
1c37045c73 Add contributors-hall-of-fame
powered by https://github.com/sourcerer-io/hall-of-fame
2020-02-07 10:57:50 -05:00
Pierric Cistac
be67d51185 node: add more infos in package.json 2020-02-05 18:07:39 -05:00
Pierric Cistac
3df188dc27 node: version 0.4.0 2020-02-05 17:38:59 -05:00
Pierric Cistac
cb8585bc4e Merge pull request #126 from huggingface/node-bindings
node: expose tokenizer configuration / truncation / padding
2020-02-05 16:53:24 -05:00
Pierric Cistac
3adf199a0c fix pad calls 2020-02-05 14:49:47 -05:00
Pierric Cistac
41fee6de3d rust: derive Copy for PaddingDirection 2020-02-05 14:44:07 -05:00
Pierric Cistac
10e2d286ca node: fix bert special tokens 2020-02-05 14:40:03 -05:00
Pierric Cistac
02ab624050 node: expose truncation/padding getters on base tokenizer 2020-02-05 14:28:53 -05:00
Pierric Cistac
51cc581f32 node: setTruncation and setPadding return the complete config 2020-02-05 14:28:53 -05:00
Pierric Cistac
a54d5f05fa node: expose tokenizers config
fix tokenizers config types
2020-02-05 14:28:53 -05:00
Pierric Cistac
2bcd47440c node: add enums for padding and truncation strategies 2020-02-05 14:28:53 -05:00