Commit Graph

194 Commits

Author SHA1 Message Date
Bjarte Johansen
f32e0c09fc Implement __new__ for PostProcessors
Allows PostProcessors to be instansiated through python class constructor.
2020-02-10 10:43:53 +01:00
Bjarte Johansen
03508826cb Implement __new__ on Decoders
Allow decoders to be initialized from python using the class
constructor.
2020-02-10 10:43:53 +01:00
Bjarte Johansen
4971e9608d Implement __new__ on Trainers
__new__ allows Trainers to be initialized in the normal python
fashion.
2020-02-10 10:43:29 +01:00
Bjarte Johansen
0e5d81b400 Implement __new__ on Normalizers
__new__ allows Normalizers to be initialized as normal python
objects. This also means that Normalizers are given the correct class
name.
2020-02-10 10:43:19 +01:00
Pierric Cistac
3adf199a0c fix pad calls 2020-02-05 14:49:47 -05:00
Anthony MOI
9745786b89 Bump versions for release 2020-02-05 13:55:51 -05:00
Anthony MOI
89f6db28f0 update cargo.lock for indicatif 2020-02-05 13:38:12 -05:00
Anthony MOI
8decd020cb Python - Provide mapping to original offsets
As requested on #81
2020-02-05 13:33:19 -05:00
Anthony MOI
42c4691e4d Python - Update Bert default special tokens
Closes #106
2020-02-05 12:55:01 -05:00
MOI Anthony
a1284f6220 Merge pull request #128 from huitseeker/warts
Maintenance : simplifications & update
2020-02-05 12:28:22 -05:00
Funtowicz Morgan
8200112e9b Introduce WordLevel model for TransformerXL (#125)
* Added lookup table model mapping string to id present in a vocab map.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* RustFmt

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Formatting.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid void return on Rust side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Python binding for LookupTable model

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Enable loading from Python's side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed LookupTable to WordLevel

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* RustFmt happy now.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* clippy happy now.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing mismatching names.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing mismatching names (one missing).

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-05 16:51:35 +00:00
François Garillot
42bc3cb21f Simplify a few Option / Result pattern-matches 2020-02-05 07:11:47 -08:00
Funtowicz Morgan
6165910ca6 Char based delimiter splitting - TransfoXL (#114)
* WIP delimiter splitter

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bind on Python side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add missing delimiter parameter in CharDelimiterSplit constructor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Attempt to provide CharDelimiterSplit for node.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Apply Rust formatting.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* fix bindings node

Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>
2020-02-04 16:23:00 +00:00
Anthony MOI
53637d4d88 Python - Also add missing special tokens for SentencePiece 2020-02-03 12:52:39 -05:00
Anthony MOI
9e0b971f20 Python - Add missing special tokens in implementations classes 2020-02-03 12:49:40 -05:00
MOI Anthony
a48b337d7b Merge pull request #99 from kdexd/get-vocab-size
Expose get_vocab_size in tokenizer python API.
2020-02-03 11:52:29 -05:00
Anthony MOI
b90104e705 Update Python bindings 2020-02-03 11:38:52 -05:00
Funtowicz Morgan
e365c1992b Improve flexibility in some Python binding (#107)
* Fix invalid method bindings on Python side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Introduce factory function to create normalizer instance from the name of an unicode normalizer.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Rename BPETokenizer to CharBPETokenizer for clarity

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Give more flexibility in the way CharBPETokenizer handles normalizers creation.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Change .pyi file to reflection Normalizer hierarchy

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make ByteLevelBPE as flexible for normalization than CharBPE.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-03 10:41:33 +00:00
Funtowicz Morgan
6524f09e99 Roberta PostProcessor (#111)
* Added RobertaProcessor on Rust side.

Required to match the double separator token in the middle of pairs.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix typo in RobertaProcessing method declaration

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correctly include RobertProcessor in the Python binding

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Roberta doesnt use token_type_ids so let's set everything to 0

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Attempt to make it works on Node side too.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* fix js bindings / `npm run lint`

* Make RustFmt happy.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>
2020-02-03 10:39:48 +00:00
Karan Desai
b027c63c37 Expose get_vocab_size in tokenizer python API. 2020-02-03 00:41:05 -05:00
Pierric Cistac
05275a9391 python: fix inverted normalized/original string range 2020-01-31 11:09:55 -05:00
Pierric Cistac
880cd7199b python: align Cargo.lock package version 2020-01-28 16:44:48 -05:00
Anthony MOI
0105021280 Bump version for Python 2020-01-22 16:07:03 -05:00
MOI Anthony
327de00d71 Merge pull request #95 from huggingface/vocab-serialization
save BPE vocab in order of ID
2020-01-22 15:49:48 -05:00
epwalsh
3a9badd2e0 save vocab in order of ID 2020-01-21 13:32:13 -08:00
Morgan Funtowicz
0b782e4507 Removed invalid class-level variable declaration.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-21 15:10:47 -05:00
Anthony MOI
da7e629e4a Bump Python version for release 2020-01-20 09:14:46 -05:00
Anthony MOI
395f605fd2 Use WhitespaceSplit for BPETokenizer 2020-01-17 18:33:29 -05:00
Anthony MOI
9c408011ae Python - Bindings for WhitespaceSplit 2020-01-17 18:15:14 -05:00
Ivan Echevarria
e82722a9c2 Fix typo in Python binding README
Trailing paren causes an error
2020-01-16 17:10:48 -08:00
MOI Anthony
457e6c9932 Merge pull request #71 from huggingface/python_example_fix
Use the same vocabs in python's example.py
2020-01-15 10:07:34 -05:00
Morgan Funtowicz
374f944e32 Use the same vocabs/merges for Python and Rust comparison.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-15 11:57:34 +01:00
Morgan Funtowicz
4839154145 Remove kwargs mapping on Tokenizer decode/decode_batch as their is only one possible arg.
This is suggested by the current issue https://github.com/huggingface/tokenizers/issues/54#issuecomment-574104841.

kwargs cannot be called as positional argument, they have to be named one, replacing kwargs with the actual skip_special_tokens
allows both (named and positional) syntax.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-15 11:16:01 +01:00
Morgan Funtowicz
894f887444 Updated train_bert_wordpiece.py as well.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-14 13:32:02 +01:00
Morgan Funtowicz
7caf9fd823 Updated train_bytelevel_bpe.py to use the high level Python API.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-14 12:00:50 +01:00
Anthony MOI
fc9e81d4ab Fix split on special tokens & bump version 2020-01-12 02:35:45 -05:00
Anthony MOI
dd569020c1 Bump python version for release 2020-01-10 13:49:26 -05:00
Anthony MOI
89e0d90c8a Python - Final fix of the typings 2020-01-10 13:30:35 -05:00
Pierric Cistac
56878a8e43 fix : 2020-01-10 13:30:35 -05:00
Pierric Cistac
958883af74 fix imports in root __init__.pyi 2020-01-10 13:30:35 -05:00
MOI Anthony
b491c0b8c4 Update Python Readme 2020-01-10 12:18:16 -05:00
Anthony MOI
b27737d97c Python - Typings update 2020-01-10 10:06:24 -05:00
thomwolf
d8f3fba245 fix training and wordpiece 2020-01-10 10:47:50 +01:00
thomwolf
1a802cb484 fix typos 2020-01-10 10:47:36 +01:00
Anthony MOI
d46ea842c2 Python - IndexableString accepts tuples directly 2020-01-10 00:32:30 -05:00
Morgan Funtowicz
be10f542ce Added SentencePiece and YouTokenToMe model extractors.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-08 22:55:00 +01:00
Anthony MOI
3af2a43cae Hotfix Python bindings 2020-01-08 16:20:05 -05:00
Anthony MOI
ef21c9a7b0 Hotfix for new Builder
cc @epwalsh
2020-01-08 16:19:51 -05:00
Anthony MOI
c7d2800131 Python - Add model saving to base tokenizer 2020-01-08 14:44:17 -05:00
Anthony MOI
bbe31f9237 Quick README update 2020-01-08 14:07:48 -05:00