Commit Graph

46 Commits

Author SHA1 Message Date
Nicolas Patry
7ed7f0f26a Adding a 3 new PreTokenizers:
- Deduplication : Removes duplicate spaces within strings
- Punctuation: Splits punctuation characters as isolated tokens
- Sequence: Applies a list of pretokenizers iteratively
2020-08-31 14:05:39 -04:00
Anthony MOI
f92c9955e7 Python - Update bindings 2020-08-19 12:42:12 -04:00
Sebastian Pütz
d62adf7195 Remove Container, changes to PyDecoder, cloneable Tokenizer.
* derive Clone on Tokenizer and AddedVocabulary.
* Replace Container with Arc wrapper for Decoders.
* Prefix Rust Decoder types with Py.
* Rename PyDecoder to CustomDecoder.
* Change panic in serializing custom decoder to exception.
* Re-enable training with cloneable Tokenizer.
* Remove unsound Container, use Arc wrappers instead.
2020-08-04 15:59:33 -04:00
Sebastian Pütz
11e86a16c5 Remove Container from PostProcessors, replace with Arc.
* prefix the Python types in Rust with Py.
* remove unsound Container wrappers, replace with Arc.
2020-08-04 15:59:33 -04:00
Sebastian Pütz
b411443128 Remove Container from PreTokenizers, replace with Arc.
* prefix the Python types in Rust with Py, rename PyPretokenizer
  to CustomPretokenizer
* remove unsound Container wrappers, replace with Arc
* change panic on trying to (de-)serialize custom pretokenizer to
  exception
2020-08-04 15:59:33 -04:00
Sebastian Pütz
08b8c48127 Remove Container from Normalizers, replace with Arc.
* prefix the Python types in Rust with Py
* remove unsound Container wrappers, replace with Arc
2020-08-04 15:59:33 -04:00
Sebastian Pütz
83a52c8080 Replace Model and Trainer Containers.
* Implement changes necessary from generic Model in Tokenizer.
* Temporarily disable training in Python since Clone can't be
  derived for Model until all components have been replaced.
* Prefix Python types in Rust with Py.
2020-08-04 15:59:33 -04:00
Anthony MOI
8bf482cecc Improve parallelism tracking and warning 2020-07-06 13:05:14 -04:00
Anthony MOI
bb668bc439 Try with target_family = unix 2020-06-23 16:52:21 -04:00
Anthony MOI
ae743f5dc1 Python - Automatically disable parallelism after fork 2020-06-22 20:31:52 -04:00
Anthony MOI
c65d53892d Python - Add bindings for new AddedToken options 2020-03-24 20:58:45 -04:00
Anthony MOI
7e9003ccb7 Python - Update bindings 2020-03-09 18:37:03 -04:00
Anthony MOI
52180a9179 Python - Add ByteLevel PostProcessor 2020-03-06 17:44:44 -05:00
Anthony MOI
2393506dc7 Python - Add ByteLevel Normalizer 2020-03-06 17:44:03 -05:00
Funtowicz Morgan
bb8321ac0d Add Strip normalizer (#140)
* WIP strip.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Rust StripNormalizer

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Allow to specify strip direction

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed StripNormalizer to Strip

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added Python binding.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Makes Strip python compatible with pythonic constructor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Run RustFmt

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Clippy next ofc.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move lstrip and rstrip on NormalizedString

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* implment strip() for normalizer + unittests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add some more unittests on edge cases.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* clippy and fmt.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Simplify strip and fix offsets

* Python - Update strip bindings with default values

Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
2020-02-17 11:26:40 +01:00
Funtowicz Morgan
8200112e9b Introduce WordLevel model for TransformerXL (#125)
* Added lookup table model mapping string to id present in a vocab map.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* RustFmt

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Formatting.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid void return on Rust side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Python binding for LookupTable model

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Enable loading from Python's side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed LookupTable to WordLevel

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* RustFmt happy now.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* clippy happy now.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing mismatching names.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing mismatching names (one missing).

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-05 16:51:35 +00:00
Funtowicz Morgan
6165910ca6 Char based delimiter splitting - TransfoXL (#114)
* WIP delimiter splitter

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bind on Python side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add missing delimiter parameter in CharDelimiterSplit constructor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Attempt to provide CharDelimiterSplit for node.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Apply Rust formatting.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* fix bindings node

Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>
2020-02-04 16:23:00 +00:00
Funtowicz Morgan
6524f09e99 Roberta PostProcessor (#111)
* Added RobertaProcessor on Rust side.

Required to match the double separator token in the middle of pairs.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix typo in RobertaProcessing method declaration

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correctly include RobertProcessor in the Python binding

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Roberta doesnt use token_type_ids so let's set everything to 0

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Attempt to make it works on Node side too.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* fix js bindings / `npm run lint`

* Make RustFmt happy.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>
2020-02-03 10:39:48 +00:00
Anthony MOI
9c408011ae Python - Bindings for WhitespaceSplit 2020-01-17 18:15:14 -05:00
Anthony MOI
88711d5717 Python - IndexableString in Encoding 2020-01-08 00:06:57 -05:00
Anthony MOI
243a45af40 Add BPEDecoder 2020-01-07 19:56:49 -05:00
Anthony MOI
5bc1e2ee05 Add Lowercase Normalizer 2020-01-07 19:40:19 -05:00
Anthony MOI
cbdd2cf423 Python - add Metaspace decoder 2020-01-07 18:40:18 -05:00
Anthony MOI
eaa23ac8e6 Add the Metaspace PreTokenizer 2020-01-07 12:59:59 -05:00
Anthony MOI
185b6f0b8b Add Sequence Normalizer 2020-01-06 21:03:05 -05:00
Anthony MOI
5c02bbbc4c Add basic unicode normalizers 2020-01-06 20:38:42 -05:00
Anthony MOI
0079a7a6b7 Python - Add NormalizedString + doc/typings 2020-01-06 17:55:22 -05:00
Anthony MOI
c51e340492 Python - Add WordPieceTrainer 2020-01-03 19:37:29 -05:00
Anthony MOI
225a886382 Python - Expose Whitespace PreTokenizer 2019-12-30 13:10:33 -05:00
Anthony MOI
3f79d9d5e0 Python - Add normalizers bindings & BertNormalizer 2019-12-29 00:36:09 -05:00
Anthony MOI
0a3d4a86a9 Python - Update bindings for BertPreTokenizer 2019-12-17 17:40:56 -05:00
Anthony MOI
93a74aa53a Python - Expose PostProcessors 2019-12-16 18:46:14 -05:00
Anthony MOI
1c7be358b7 Python - Better error conversions 2019-12-13 12:14:27 -05:00
Anthony MOI
8cedc5f1f6 Update Python bindings for Encoding 2019-12-10 12:38:36 -05:00
Anthony MOI
849272d44f Python - add missing modules exports 2019-12-09 12:50:53 -05:00
Anthony MOI
eaafb22511 Add bindings for Trainer in Python 2019-12-03 15:54:15 -05:00
Anthony MOI
b081e6ca04 Python - Also expose default classes 2019-11-24 00:35:05 -05:00
Anthony MOI
8fbe3c2662 Python - Add decoders 2019-11-22 21:08:57 -05:00
Anthony MOI
f6a9b57b5b Python - Add pre_tokenizers module 2019-11-22 20:56:50 -05:00
Anthony MOI
39a6d04c53 Improve Python bindings
This is an attempt at actually exposing the same structure that we use in the Rust lib. This will allow Python to instantiate Model/PreTokenizer/... with their own arguments, combining everything without relying on parsed kwargs.
2019-11-22 17:57:36 -05:00
Anthony MOI
c28a83cdc4 Update python bindings 2019-11-21 11:55:07 -05:00
Anthony MOI
3ec26b332c Add Tokenizer token_to_id/id_to_token 2019-11-20 17:28:28 -05:00
Anthony MOI
351d526e1e Basic python bindings 2019-11-19 19:31:37 -05:00
Anthony MOI
fd7ec39367 Update python bindings 2019-11-01 18:56:55 -04:00
Anthony MOI
8448d50e6f Quick improvement over python bindings 2019-11-01 16:08:10 -04:00
Anthony MOI
5d37cfde7f Python bindings backbone 2019-11-01 15:02:19 -04:00