Commit Graph

216 Commits

Author SHA1 Message Date
Anthony MOI
f263d7651f Python - RustFmt 2020-02-18 15:07:34 -05:00
Anthony MOI
8e9fae6be4 Python - Add check-style to Makefile 2020-02-18 11:11:07 -05:00
Anthony MOI
81be207819 Python - Black auto formatting 2020-02-18 10:45:36 -05:00
Anthony MOI
4706151c32 Python - Add Makefile with Black formatting 2020-02-18 10:45:10 -05:00
Anthony MOI
1509f747af Python - Uniformize implementations parameters 2020-02-18 10:27:10 -05:00
MOI Anthony
3512bd3400 Merge pull request #149 from colinclement/master
Allow dropout option in ByteLevelBPETokenizer
2020-02-18 09:59:40 -05:00
Morgan Funtowicz
891dd4adb8 Fix invalid num_added_tokens method call in BaseTokenizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-17 15:32:34 +01:00
Funtowicz Morgan
bb8321ac0d Add Strip normalizer (#140)
* WIP strip.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Rust StripNormalizer

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Allow to specify strip direction

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed StripNormalizer to Strip

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added Python binding.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Makes Strip python compatible with pythonic constructor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Run RustFmt

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Clippy next ofc.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move lstrip and rstrip on NormalizedString

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* implment strip() for normalizer + unittests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add some more unittests on edge cases.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* clippy and fmt.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Simplify strip and fix offsets

* Python - Update strip bindings with default values

Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
2020-02-17 11:26:40 +01:00
Colin Clement
e591cfce7b pass through dropout option in ByteLevelBPETokenizer 2020-02-15 01:58:55 +00:00
MOI Anthony
3cac26cdb2 Merge pull request #147 from huggingface/wordpiece-cleanup
Wordpiece Decoder cleanup
2020-02-14 13:12:15 -05:00
Funtowicz Morgan
c4bac6aeeb Expose num_added_tokens on Python side (#146)
* Expose num_added_tokens on Python side without the need to pass an Encoding to added_tokens.

This allows to compute the max sentence length for single/pair inputs without actually the need to have an Encoding structure.
As the number of added tokens is fixed and static during compilation it allows more flexible usage of the method.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed num_added_tokens to num_special_tokens_to_add.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-14 10:55:20 +00:00
Anthony MOI
1907b74d1c Python - Bindings for Wordpiece decoder's cleanup 2020-02-13 17:50:37 -05:00
Anthony MOI
5bd93ee822 Python - hotfix BertWordPieceTokenizer decoder 2020-02-13 16:31:00 -05:00
Anthony MOI
bbbd97c7e1 Python - Bump version for release 2020-02-11 08:15:11 -05:00
Anthony MOI
08ce105195 Python - Hotfix WordPieceTrainer constructor 2020-02-11 08:13:57 -05:00
Anthony MOI
c1ddfdac8c Python - bump version for release 2020-02-10 23:23:27 -05:00
Anthony MOI
3c0164ef75 Python - Bump version for release 2020-02-10 16:07:32 -05:00
Anthony MOI
43a989775e Python - Improve typings 2020-02-10 13:53:07 -05:00
Anthony MOI
dd9270a406 Python - Fix example.py for GPT-2
cc @mfuntowicz `from_pretrained` takes only on argument. Do you know if
we can make this compatible otherwise?
2020-02-10 13:51:03 -05:00
Anthony MOI
8585b761d1 Python - More updates to the new API 2020-02-10 11:57:30 -05:00
Anthony MOI
505c428f72 Python - Update example.py with new API 2020-02-10 11:55:14 -05:00
Bjarte Johansen
6a4976ddd6 Implement __new__ for PreTokenizers
__new__ allows PreTokenizers to be instansiated through the python
constructor.
2020-02-10 10:43:53 +01:00
Bjarte Johansen
f32e0c09fc Implement __new__ for PostProcessors
Allows PostProcessors to be instansiated through python class constructor.
2020-02-10 10:43:53 +01:00
Bjarte Johansen
03508826cb Implement __new__ on Decoders
Allow decoders to be initialized from python using the class
constructor.
2020-02-10 10:43:53 +01:00
Bjarte Johansen
4971e9608d Implement __new__ on Trainers
__new__ allows Trainers to be initialized in the normal python
fashion.
2020-02-10 10:43:29 +01:00
Bjarte Johansen
0e5d81b400 Implement __new__ on Normalizers
__new__ allows Normalizers to be initialized as normal python
objects. This also means that Normalizers are given the correct class
name.
2020-02-10 10:43:19 +01:00
Pierric Cistac
3adf199a0c fix pad calls 2020-02-05 14:49:47 -05:00
Anthony MOI
9745786b89 Bump versions for release 2020-02-05 13:55:51 -05:00
Anthony MOI
89f6db28f0 update cargo.lock for indicatif 2020-02-05 13:38:12 -05:00
Anthony MOI
8decd020cb Python - Provide mapping to original offsets
As requested on #81
2020-02-05 13:33:19 -05:00
Anthony MOI
42c4691e4d Python - Update Bert default special tokens
Closes #106
2020-02-05 12:55:01 -05:00
MOI Anthony
a1284f6220 Merge pull request #128 from huitseeker/warts
Maintenance : simplifications & update
2020-02-05 12:28:22 -05:00
Funtowicz Morgan
8200112e9b Introduce WordLevel model for TransformerXL (#125)
* Added lookup table model mapping string to id present in a vocab map.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* RustFmt

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Formatting.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid void return on Rust side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Python binding for LookupTable model

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Enable loading from Python's side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed LookupTable to WordLevel

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* RustFmt happy now.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* clippy happy now.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing mismatching names.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing mismatching names (one missing).

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-05 16:51:35 +00:00
François Garillot
42bc3cb21f Simplify a few Option / Result pattern-matches 2020-02-05 07:11:47 -08:00
Funtowicz Morgan
6165910ca6 Char based delimiter splitting - TransfoXL (#114)
* WIP delimiter splitter

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bind on Python side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add missing delimiter parameter in CharDelimiterSplit constructor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Attempt to provide CharDelimiterSplit for node.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Apply Rust formatting.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* fix bindings node

Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>
2020-02-04 16:23:00 +00:00
Anthony MOI
53637d4d88 Python - Also add missing special tokens for SentencePiece 2020-02-03 12:52:39 -05:00
Anthony MOI
9e0b971f20 Python - Add missing special tokens in implementations classes 2020-02-03 12:49:40 -05:00
MOI Anthony
a48b337d7b Merge pull request #99 from kdexd/get-vocab-size
Expose get_vocab_size in tokenizer python API.
2020-02-03 11:52:29 -05:00
Anthony MOI
b90104e705 Update Python bindings 2020-02-03 11:38:52 -05:00
Funtowicz Morgan
e365c1992b Improve flexibility in some Python binding (#107)
* Fix invalid method bindings on Python side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Introduce factory function to create normalizer instance from the name of an unicode normalizer.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Rename BPETokenizer to CharBPETokenizer for clarity

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Give more flexibility in the way CharBPETokenizer handles normalizers creation.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Change .pyi file to reflection Normalizer hierarchy

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make ByteLevelBPE as flexible for normalization than CharBPE.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-03 10:41:33 +00:00
Funtowicz Morgan
6524f09e99 Roberta PostProcessor (#111)
* Added RobertaProcessor on Rust side.

Required to match the double separator token in the middle of pairs.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix typo in RobertaProcessing method declaration

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correctly include RobertProcessor in the Python binding

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Roberta doesnt use token_type_ids so let's set everything to 0

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Attempt to make it works on Node side too.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* fix js bindings / `npm run lint`

* Make RustFmt happy.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>
2020-02-03 10:39:48 +00:00
Karan Desai
b027c63c37 Expose get_vocab_size in tokenizer python API. 2020-02-03 00:41:05 -05:00
Pierric Cistac
05275a9391 python: fix inverted normalized/original string range 2020-01-31 11:09:55 -05:00
Pierric Cistac
880cd7199b python: align Cargo.lock package version 2020-01-28 16:44:48 -05:00
Anthony MOI
0105021280 Bump version for Python 2020-01-22 16:07:03 -05:00
MOI Anthony
327de00d71 Merge pull request #95 from huggingface/vocab-serialization
save BPE vocab in order of ID
2020-01-22 15:49:48 -05:00
epwalsh
3a9badd2e0 save vocab in order of ID 2020-01-21 13:32:13 -08:00
Morgan Funtowicz
0b782e4507 Removed invalid class-level variable declaration.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-21 15:10:47 -05:00
Anthony MOI
da7e629e4a Bump Python version for release 2020-01-20 09:14:46 -05:00
Anthony MOI
395f605fd2 Use WhitespaceSplit for BPETokenizer 2020-01-17 18:33:29 -05:00