Commit Graph

54 Commits

Author SHA1 Message Date
d1b6b14bd7 Attempt fix workflows 2019-11-29 19:28:49 -05:00
989e9b03ca Ignore some python files 2019-11-27 12:22:01 -05:00
428890d6e0 Basic python setuptools 2019-11-27 12:21:37 -05:00
e49abab747 Python - Add Decoder/PreTokenizer standalone capabilities 2019-11-26 17:52:19 -05:00
d565bbf309 Container - Add ability to execute 2019-11-26 17:51:26 -05:00
5c6834f363 Added GitHub Action workflow for Rust
This allows for automated build & test of the library.
2019-11-26 09:47:48 +00:00
f4369b312d Python - Add ability to create custom Decoder 2019-11-25 19:14:07 -05:00
d7ba6802df Update gitignore 2019-11-25 15:35:54 -05:00
512e85dfda Update python README 2019-11-24 00:55:13 -05:00
bafdc5e157 Code style 2019-11-24 00:52:48 -05:00
6437c40235 Python - PoC Custom PreTokenizer 2019-11-24 00:52:13 -05:00
b081e6ca04 Python - Also expose default classes 2019-11-24 00:35:05 -05:00
bd1aa80d8a Python - Custom PreTokenizer backbone 2019-11-23 23:59:33 -05:00
891fc12de2 Python - Update example with new format 2019-11-22 21:09:17 -05:00
8fbe3c2662 Python - Add decoders 2019-11-22 21:08:57 -05:00
e44f52024c Python - Set a PreTokenizer in a model 2019-11-22 21:01:52 -05:00
9b71c8f8de Python - BPE construction 2019-11-22 20:57:54 -05:00
f6a9b57b5b Python - Add pre_tokenizers module 2019-11-22 20:56:50 -05:00
39a6d04c53 Improve Python bindings
This is an attempt at actually exposing the same structure that we use in the Rust lib. This will allow Python to instantiate Model/PreTokenizer/... with their own arguments, combining everything without relying on parsed kwargs.
2019-11-22 17:57:36 -05:00
663644e041 Fix ByteLevel Decoder
The join was done after replacing bytes and building subwords, which was preventing bytes across these subwords to be merged correctly. We need to join first.
2019-11-21 16:50:25 -05:00
634415c098 Add a parallel capable cache for BPE
This allows for some performance improvement in the best case scenarios (up to 40% during some tests)
2019-11-21 16:09:07 -05:00
070fd08583 Update python example 2019-11-21 11:57:57 -05:00
c28a83cdc4 Update python bindings 2019-11-21 11:55:07 -05:00
6853e6c904 Tokenizer decoding 2019-11-21 11:54:54 -05:00
2419c14e42 ByteLevel is also a Decoder 2019-11-21 11:52:55 -05:00
56e37475c3 Add Decoder to Tokenizer 2019-11-21 11:51:43 -05:00
3ec26b332c Add Tokenizer token_to_id/id_to_token 2019-11-20 17:28:28 -05:00
8b3d7d1aa0 Add vocab/merge arguments to example.py 2019-11-20 16:47:02 -05:00
98323d1f21 Update readme and fix example 2019-11-19 19:38:57 -05:00
351d526e1e Basic python bindings 2019-11-19 19:31:37 -05:00
39afc64e13 impl PreTokenizer for Whitespace 2019-11-19 19:31:37 -05:00
2d7c5f04f8 Fix readme indentation 2019-11-18 16:34:13 -05:00
1b32560067 Update readme with simple example 2019-11-18 16:31:35 -05:00
872aa86b71 Basic cli for testing 2019-11-18 15:47:35 -05:00
4e5106989f Ability to load a BPE model from files 2019-11-18 10:00:53 -05:00
0b450d62ff Add ByteLevel pre tokenizer 2019-11-17 00:40:22 -05:00
a55dccafb5 Add BPE training 2019-11-17 00:28:36 -05:00
1c7dcebca7 Add BPE tokenization 2019-11-17 00:27:30 -05:00
b2ba864248 Move whitespace pre tokenizer 2019-11-16 22:42:02 -05:00
1294f400dc Add folder structure 2019-11-16 22:40:51 -05:00
7b8b765269 Add Tokenizer interface 2019-11-16 22:36:44 -05:00
195423fe11 Rust install 2019-11-01 19:45:00 -04:00
9f15d2c165 Node readme 2019-11-01 19:44:44 -04:00
05cbb32eca Python readme 2019-11-01 19:42:36 -04:00
6d91bf4005 Update node bindings 2019-11-01 19:23:22 -04:00
fd7ec39367 Update python bindings 2019-11-01 18:56:55 -04:00
9fd10ca1c5 Simple whitespace tokenizer 2019-11-01 18:31:05 -04:00
57a1ce7e1d Node bindings backbone 2019-11-01 16:39:03 -04:00
8448d50e6f Quick improvement over python bindings 2019-11-01 16:08:10 -04:00
5d37cfde7f Python bindings backbone 2019-11-01 15:02:19 -04:00