Commit Graph

1844 Commits

Author SHA1 Message Date
Anthony MOI
6437c40235 Python - PoC Custom PreTokenizer 2019-11-24 00:52:13 -05:00
Anthony MOI
b081e6ca04 Python - Also expose default classes 2019-11-24 00:35:05 -05:00
Anthony MOI
bd1aa80d8a Python - Custom PreTokenizer backbone 2019-11-23 23:59:33 -05:00
Anthony MOI
891fc12de2 Python - Update example with new format 2019-11-22 21:09:17 -05:00
Anthony MOI
8fbe3c2662 Python - Add decoders 2019-11-22 21:08:57 -05:00
Anthony MOI
e44f52024c Python - Set a PreTokenizer in a model 2019-11-22 21:01:52 -05:00
Anthony MOI
9b71c8f8de Python - BPE construction 2019-11-22 20:57:54 -05:00
Anthony MOI
f6a9b57b5b Python - Add pre_tokenizers module 2019-11-22 20:56:50 -05:00
Anthony MOI
39a6d04c53 Improve Python bindings
This is an attempt at actually exposing the same structure that we use in the Rust lib. This will allow Python to instantiate Model/PreTokenizer/... with their own arguments, combining everything without relying on parsed kwargs.
2019-11-22 17:57:36 -05:00
Anthony MOI
663644e041 Fix ByteLevel Decoder
The join was done after replacing bytes and building subwords, which was preventing bytes across these subwords to be merged correctly. We need to join first.
2019-11-21 16:50:25 -05:00
Anthony MOI
634415c098 Add a parallel capable cache for BPE
This allows for some performance improvement in the best case scenarios (up to 40% during some tests)
2019-11-21 16:09:07 -05:00
Anthony MOI
070fd08583 Update python example 2019-11-21 11:57:57 -05:00
Anthony MOI
c28a83cdc4 Update python bindings 2019-11-21 11:55:07 -05:00
Anthony MOI
6853e6c904 Tokenizer decoding 2019-11-21 11:54:54 -05:00
Anthony MOI
2419c14e42 ByteLevel is also a Decoder 2019-11-21 11:52:55 -05:00
Anthony MOI
56e37475c3 Add Decoder to Tokenizer 2019-11-21 11:51:43 -05:00
Anthony MOI
3ec26b332c Add Tokenizer token_to_id/id_to_token 2019-11-20 17:28:28 -05:00
Anthony MOI
8b3d7d1aa0 Add vocab/merge arguments to example.py 2019-11-20 16:47:02 -05:00
Anthony MOI
98323d1f21 Update readme and fix example 2019-11-19 19:38:57 -05:00
Anthony MOI
351d526e1e Basic python bindings 2019-11-19 19:31:37 -05:00
Anthony MOI
39afc64e13 impl PreTokenizer for Whitespace 2019-11-19 19:31:37 -05:00
MOI Anthony
2d7c5f04f8 Fix readme indentation 2019-11-18 16:34:13 -05:00
Anthony MOI
1b32560067 Update readme with simple example 2019-11-18 16:31:35 -05:00
Anthony MOI
872aa86b71 Basic cli for testing 2019-11-18 15:47:35 -05:00
Anthony MOI
4e5106989f Ability to load a BPE model from files 2019-11-18 10:00:53 -05:00
Anthony MOI
0b450d62ff Add ByteLevel pre tokenizer 2019-11-17 00:40:22 -05:00
Anthony MOI
a55dccafb5 Add BPE training 2019-11-17 00:28:36 -05:00
Anthony MOI
1c7dcebca7 Add BPE tokenization 2019-11-17 00:27:30 -05:00
Anthony MOI
b2ba864248 Move whitespace pre tokenizer 2019-11-16 22:42:02 -05:00
Anthony MOI
1294f400dc Add folder structure 2019-11-16 22:40:51 -05:00
Anthony MOI
7b8b765269 Add Tokenizer interface 2019-11-16 22:36:44 -05:00
Anthony MOI
195423fe11 Rust install 2019-11-01 19:45:00 -04:00
Anthony MOI
9f15d2c165 Node readme 2019-11-01 19:44:44 -04:00
Anthony MOI
05cbb32eca Python readme 2019-11-01 19:42:36 -04:00
Anthony MOI
6d91bf4005 Update node bindings 2019-11-01 19:23:22 -04:00
Anthony MOI
fd7ec39367 Update python bindings 2019-11-01 18:56:55 -04:00
Anthony MOI
9fd10ca1c5 Simple whitespace tokenizer 2019-11-01 18:31:05 -04:00
Anthony MOI
57a1ce7e1d Node bindings backbone 2019-11-01 16:39:03 -04:00
Anthony MOI
8448d50e6f Quick improvement over python bindings 2019-11-01 16:08:10 -04:00
Anthony MOI
5d37cfde7f Python bindings backbone 2019-11-01 15:02:19 -04:00
Anthony MOI
5f57ee9f0e Global gitignore 2019-11-01 14:55:08 -04:00
Anthony MOI
7dbee7157f add gitignore 2019-11-01 13:56:08 -04:00
Anthony MOI
2b72a5737f Basic bin+lib setup 2019-11-01 13:54:17 -04:00
Anthony MOI
b9b519c84a Initial commit 2019-11-01 13:52:44 -04:00