Anthony MOI
4b596e19dd
Rust - Improve training progress for multiple files
2020-03-03 11:04:24 -05:00
Anthony MOI
8e791791d1
Python - prepare for release
2020-03-02 14:56:42 -05:00
Anthony MOI
4deeb9511f
Update CHANGELOGs
2020-03-02 14:37:17 -05:00
Anthony MOI
f8f0702d98
Fix LongestFirst truncation strategy
2020-02-29 16:26:13 -05:00
Anthony MOI
657f8b6c15
Rust & Python - Update CHANGELOGs
2020-02-26 11:30:44 -05:00
Anthony MOI
3b10d640d5
Rust & Python - Update CHANGELOGs
2020-02-26 10:51:40 -05:00
Anthony MOI
2425fe877d
Python - Update CHANGELOG
2020-02-26 09:31:17 -05:00
Anthony MOI
61b4c9c30a
Python - Add missing tokens to BertWordPieceTokenizer
2020-02-26 09:21:54 -05:00
Anthony MOI
440e8e9bd9
Python - Bump version for release
2020-02-24 16:08:49 -05:00
Anthony MOI
be08d9574c
Python - Add Changelog
2020-02-24 10:12:50 -05:00
Anthony MOI
999088ef94
Python - Bump version for release
2020-02-24 09:56:08 -05:00
Morgan Funtowicz
817b760ab9
Make name parameter Optional[str] on BaseTokenizer
...
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
2020-02-22 14:57:43 +01:00
Morgan Funtowicz
d274a7691d
Avoid breaking changes and let parameter name be Optional.
...
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
2020-02-22 14:56:59 +01:00
Morgan Funtowicz
0fc8be9d69
Formatting for python binding.
...
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
2020-02-22 00:17:44 +01:00
Morgan Funtowicz
f88a6b40ac
Make parameter name on Model.save() optional.
...
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
2020-02-22 00:01:32 +01:00
Anthony MOI
11dd6c8bae
Python - Bump version for release
2020-02-18 18:49:11 -05:00
Anthony MOI
41929462c7
Python - Add classifiers
2020-02-18 18:48:21 -05:00
Anthony MOI
d8a73c89a7
Python - Add Encoding length
2020-02-18 18:24:13 -05:00
Anthony MOI
d48fdbe057
Python - Only add special tokens when in-vocabulary
2020-02-18 17:27:27 -05:00
Anthony MOI
5daf1eea86
Python - Replace last BPETokenizer occurences
2020-02-18 16:25:59 -05:00
Anthony MOI
f263d7651f
Python - RustFmt
2020-02-18 15:07:34 -05:00
Anthony MOI
8e9fae6be4
Python - Add check-style to Makefile
2020-02-18 11:11:07 -05:00
Anthony MOI
81be207819
Python - Black auto formatting
2020-02-18 10:45:36 -05:00
Anthony MOI
4706151c32
Python - Add Makefile with Black formatting
2020-02-18 10:45:10 -05:00
Anthony MOI
1509f747af
Python - Uniformize implementations parameters
2020-02-18 10:27:10 -05:00
MOI Anthony
3512bd3400
Merge pull request #149 from colinclement/master
...
Allow dropout option in ByteLevelBPETokenizer
2020-02-18 09:59:40 -05:00
Morgan Funtowicz
891dd4adb8
Fix invalid num_added_tokens method call in BaseTokenizer.
...
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
2020-02-17 15:32:34 +01:00
Funtowicz Morgan
bb8321ac0d
Add Strip normalizer ( #140 )
...
* WIP strip.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
* Rust StripNormalizer
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
* Allow to specify strip direction
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
* Renamed StripNormalizer to Strip
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
* Added Python binding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
* Makes Strip python compatible with pythonic constructor.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
* Run RustFmt
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
* Clippy next ofc.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
* Move lstrip and rstrip on NormalizedString
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
* implment strip() for normalizer + unittests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
* Add some more unittests on edge cases.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
* clippy and fmt.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
* Simplify strip and fix offsets
* Python - Update strip bindings with default values
Co-authored-by: MOI Anthony <xn1t0x@gmail.com >
2020-02-17 11:26:40 +01:00
Colin Clement
e591cfce7b
pass through dropout option in ByteLevelBPETokenizer
2020-02-15 01:58:55 +00:00
MOI Anthony
3cac26cdb2
Merge pull request #147 from huggingface/wordpiece-cleanup
...
Wordpiece Decoder cleanup
2020-02-14 13:12:15 -05:00
Funtowicz Morgan
c4bac6aeeb
Expose num_added_tokens on Python side ( #146 )
...
* Expose num_added_tokens on Python side without the need to pass an Encoding to added_tokens.
This allows to compute the max sentence length for single/pair inputs without actually the need to have an Encoding structure.
As the number of added tokens is fixed and static during compilation it allows more flexible usage of the method.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
* Renamed num_added_tokens to num_special_tokens_to_add.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co >
2020-02-14 10:55:20 +00:00
Anthony MOI
1907b74d1c
Python - Bindings for Wordpiece decoder's cleanup
2020-02-13 17:50:37 -05:00
Anthony MOI
5bd93ee822
Python - hotfix BertWordPieceTokenizer decoder
2020-02-13 16:31:00 -05:00
Anthony MOI
bbbd97c7e1
Python - Bump version for release
2020-02-11 08:15:11 -05:00
Anthony MOI
08ce105195
Python - Hotfix WordPieceTrainer constructor
2020-02-11 08:13:57 -05:00
Anthony MOI
c1ddfdac8c
Python - bump version for release
2020-02-10 23:23:27 -05:00
Anthony MOI
3c0164ef75
Python - Bump version for release
2020-02-10 16:07:32 -05:00
Anthony MOI
43a989775e
Python - Improve typings
2020-02-10 13:53:07 -05:00
Anthony MOI
dd9270a406
Python - Fix example.py for GPT-2
...
cc @mfuntowicz `from_pretrained` takes only on argument. Do you know if
we can make this compatible otherwise?
2020-02-10 13:51:03 -05:00
Anthony MOI
8585b761d1
Python - More updates to the new API
2020-02-10 11:57:30 -05:00
Anthony MOI
505c428f72
Python - Update example.py with new API
2020-02-10 11:55:14 -05:00
Bjarte Johansen
6a4976ddd6
Implement __new__ for PreTokenizers
...
__new__ allows PreTokenizers to be instansiated through the python
constructor.
2020-02-10 10:43:53 +01:00
Bjarte Johansen
f32e0c09fc
Implement __new__ for PostProcessors
...
Allows PostProcessors to be instansiated through python class constructor.
2020-02-10 10:43:53 +01:00
Bjarte Johansen
03508826cb
Implement __new__ on Decoders
...
Allow decoders to be initialized from python using the class
constructor.
2020-02-10 10:43:53 +01:00
Bjarte Johansen
4971e9608d
Implement __new__ on Trainers
...
__new__ allows Trainers to be initialized in the normal python
fashion.
2020-02-10 10:43:29 +01:00
Bjarte Johansen
0e5d81b400
Implement __new__ on Normalizers
...
__new__ allows Normalizers to be initialized as normal python
objects. This also means that Normalizers are given the correct class
name.
2020-02-10 10:43:19 +01:00
Pierric Cistac
3adf199a0c
fix pad calls
2020-02-05 14:49:47 -05:00
Anthony MOI
9745786b89
Bump versions for release
2020-02-05 13:55:51 -05:00
Anthony MOI
89f6db28f0
update cargo.lock for indicatif
2020-02-05 13:38:12 -05:00
Anthony MOI
8decd020cb
Python - Provide mapping to original offsets
...
As requested on #81
2020-02-05 13:33:19 -05:00