Commit Graph

519 Commits

Author SHA1 Message Date
Bjarte Johansen
f32e0c09fc Implement __new__ for PostProcessors
Allows PostProcessors to be instansiated through python class constructor.
2020-02-10 10:43:53 +01:00
Bjarte Johansen
03508826cb Implement __new__ on Decoders
Allow decoders to be initialized from python using the class
constructor.
2020-02-10 10:43:53 +01:00
Bjarte Johansen
4971e9608d Implement __new__ on Trainers
__new__ allows Trainers to be initialized in the normal python
fashion.
2020-02-10 10:43:29 +01:00
Bjarte Johansen
0e5d81b400 Implement __new__ on Normalizers
__new__ allows Normalizers to be initialized as normal python
objects. This also means that Normalizers are given the correct class
name.
2020-02-10 10:43:19 +01:00
Pierric Cistac
be67d51185 node: add more infos in package.json 2020-02-05 18:07:39 -05:00
Pierric Cistac
3df188dc27 node: version 0.4.0 2020-02-05 17:38:59 -05:00
Pierric Cistac
cb8585bc4e Merge pull request #126 from huggingface/node-bindings
node: expose tokenizer configuration / truncation / padding
2020-02-05 16:53:24 -05:00
Pierric Cistac
3adf199a0c fix pad calls 2020-02-05 14:49:47 -05:00
Pierric Cistac
41fee6de3d rust: derive Copy for PaddingDirection 2020-02-05 14:44:07 -05:00
Pierric Cistac
10e2d286ca node: fix bert special tokens 2020-02-05 14:40:03 -05:00
Pierric Cistac
02ab624050 node: expose truncation/padding getters on base tokenizer 2020-02-05 14:28:53 -05:00
Pierric Cistac
51cc581f32 node: setTruncation and setPadding return the complete config 2020-02-05 14:28:53 -05:00
Pierric Cistac
a54d5f05fa node: expose tokenizers config
fix tokenizers config types
2020-02-05 14:28:53 -05:00
Pierric Cistac
2bcd47440c node: add enums for padding and truncation strategies 2020-02-05 14:28:53 -05:00
Anthony MOI
3b2414c200 Fix indentation in README for consistency 2020-02-05 14:15:25 -05:00
Anthony MOI
32e6856c6c Ignore rust-toolchain when publishing 2020-02-05 14:12:28 -05:00
Anthony MOI
e2e9cff606 Add rust-toolchain 2020-02-05 14:10:46 -05:00
Anthony MOI
9745786b89 Bump versions for release 2020-02-05 13:55:51 -05:00
Anthony MOI
89f6db28f0 update cargo.lock for indicatif 2020-02-05 13:38:12 -05:00
Anthony MOI
8decd020cb Python - Provide mapping to original offsets
As requested on #81
2020-02-05 13:33:19 -05:00
Anthony MOI
42c4691e4d Python - Update Bert default special tokens
Closes #106
2020-02-05 12:55:01 -05:00
MOI Anthony
a1284f6220 Merge pull request #128 from huitseeker/warts
Maintenance : simplifications & update
2020-02-05 12:28:22 -05:00
Funtowicz Morgan
8200112e9b Introduce WordLevel model for TransformerXL (#125)
* Added lookup table model mapping string to id present in a vocab map.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* RustFmt

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Formatting.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid void return on Rust side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Python binding for LookupTable model

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Enable loading from Python's side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed LookupTable to WordLevel

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* RustFmt happy now.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* clippy happy now.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing mismatching names.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing mismatching names (one missing).

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-05 16:51:35 +00:00
François Garillot
d4f71e50ad update indicatif 2020-02-05 07:11:47 -08:00
François Garillot
42bc3cb21f Simplify a few Option / Result pattern-matches 2020-02-05 07:11:47 -08:00
Pierric Cistac
9770be5661 node: fix encodinggetSpecialTokensMask type 2020-02-04 16:59:46 -05:00
Funtowicz Morgan
6165910ca6 Char based delimiter splitting - TransfoXL (#114)
* WIP delimiter splitter

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bind on Python side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add missing delimiter parameter in CharDelimiterSplit constructor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Attempt to provide CharDelimiterSplit for node.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Apply Rust formatting.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* fix bindings node

Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>
2020-02-04 16:23:00 +00:00
MOI Anthony
3adb220973 Merge pull request #124 from huggingface/fix-overflowing-padding
rust: fix padding on overflowings
2020-02-03 19:01:07 -05:00
Pierric Cistac
bd0f52f3d1 rust: fix padding on overflowings
shadowing of `pad_length` made it useless on overflowings
2020-02-03 18:32:49 -05:00
Pierric Cistac
7051480c33 node: expose more methods in base tokenizer 2020-02-03 17:51:53 -05:00
Pierric Cistac
220bd0d9df node: uniformize tests semantic 2020-02-03 17:51:53 -05:00
Pierric Cistac
acef252dac node: add special tokens in tokenizers implementations 2020-02-03 17:49:51 -05:00
Anthony MOI
53637d4d88 Python - Also add missing special tokens for SentencePiece 2020-02-03 12:52:39 -05:00
Anthony MOI
9e0b971f20 Python - Add missing special tokens in implementations classes 2020-02-03 12:49:40 -05:00
Pierric Cistac
4940f26b65 node: fix build error handling 2020-02-03 12:07:49 -05:00
MOI Anthony
a48b337d7b Merge pull request #99 from kdexd/get-vocab-size
Expose get_vocab_size in tokenizer python API.
2020-02-03 11:52:29 -05:00
MOI Anthony
0094393610 Merge pull request #77 from huggingface/improve-truncation
Improve truncation
2020-02-03 11:49:46 -05:00
Pierric Cistac
e55905126d Fix js overflowing tests 2020-02-03 11:41:09 -05:00
Anthony MOI
9fd64a7863 Update bert processing and padding 2020-02-03 11:38:52 -05:00
Anthony MOI
81457c0241 Node - Actually keep the previous name 2020-02-03 11:38:52 -05:00
Anthony MOI
b90104e705 Update Python bindings 2020-02-03 11:38:52 -05:00
Anthony MOI
ffda63cd33 Update node bindings 2020-02-03 11:38:52 -05:00
Anthony MOI
c2978457ae Handle merging two Encoding and their overflowings 2020-02-03 11:38:52 -05:00
Anthony MOI
4a5d2b1053 Handle padding of the overflowings 2020-02-03 11:38:52 -05:00
Anthony MOI
68f99bb822 Improve the truncation of an Encoding 2020-02-03 11:38:52 -05:00
Pierric Cistac
78e26905a7 Merge pull request #109 from huggingface/node-bindings
node: add tokenizer truncation / padding bindings
2020-02-03 11:38:05 -05:00
Pierric Cistac
75f56a0adc node: add some padding / truncation tests 2020-02-03 11:31:30 -05:00
Pierric Cistac
680eed15e7 node: add basic test on tokenizer methods 2020-02-03 11:31:30 -05:00
Pierric Cistac
461052c06f node: add disablePadding and disableTruncation in Tokenizer 2020-02-03 11:31:30 -05:00
Pierric Cistac
7e36239d74 node: add setPadding in Tokenizer 2020-02-03 11:31:30 -05:00