1870 Commits

Author SHA1 Message Date
cc9f9107fa Update cli with some example added tokens 2019-12-16 18:50:40 -05:00
036ee603f4 Python - Update example 2019-12-16 18:50:21 -05:00
e4ce050b73 Fix BertProcessing overflowing usize 2019-12-16 18:46:58 -05:00
93a74aa53a Python - Expose PostProcessors 2019-12-16 18:46:14 -05:00
1a90cc96e5 Python - Can add tokens 2019-12-16 18:45:26 -05:00
f92e73b8f3 Ability to decode with added tokens 2019-12-16 18:22:46 -05:00
4c7f6e1f04 Ability to encode with added tokens 2019-12-16 18:22:17 -05:00
45c2d25a9f Tokenizer can have added tokens 2019-12-16 18:21:51 -05:00
ee883c3fc7 Bump version for release 2019-12-13 18:18:07 -05:00
ed7e3999d2 Python - Fix some clippy warnings 2019-12-13 18:17:51 -05:00
1a604cdbee Revert wrong change 2019-12-13 18:13:16 -05:00
6b1028d550 Add clippy warnings + fix all of them 2019-12-13 17:53:46 -05:00
24139d7324 Improve some Python classes 2019-12-13 17:53:46 -05:00
4914e6285e add path to manifest 2019-12-13 17:53:32 -05:00
7f42417482 fix yaml 2019-12-13 17:53:32 -05:00
7e6fd92018 fix formatting 2019-12-13 17:53:32 -05:00
03406d0b54 add rustfmt and clippy to CI pipeline 2019-12-13 17:53:32 -05:00
dc48cc3680 fix a couple linting warnings 2019-12-13 17:53:32 -05:00
1c4593cad4 Python - Remove warning on unused Token 2019-12-13 15:28:48 -05:00
e93cc62a71 Python - Handle kwargs for bert modules 2019-12-13 15:28:29 -05:00
3355be89cd Python - Update examples and improve errors 2019-12-13 14:37:29 -05:00
7cf4b3a6cd Python - Rewrite PyDecoder and PyPreTokenizer 2019-12-13 12:20:25 -05:00
2a0ad97809 Python - Update API to allow failure 2019-12-13 12:20:05 -05:00
1c7be358b7 Python - Better error conversions 2019-12-13 12:14:27 -05:00
7711946882 Add some tests for Encoding 2019-12-12 19:03:42 -05:00
da45a1d6d0 Extract encoding 2019-12-12 18:04:42 -05:00
5bf8baec68 Prepare tokenizer module for multiple files 2019-12-12 18:04:42 -05:00
34ffe6dc1a Add Bert PostProcessor 2019-12-12 18:04:42 -05:00
f4cd78e98a Add truncation ability 2019-12-12 18:04:42 -05:00
13df36ca55 fix error display 2019-12-12 10:50:36 -05:00
3bdb849bb3 Fix cli + whitespace 2019-12-11 07:31:28 -05:00
4807894da6 BPE can fail 2019-12-11 07:30:51 -05:00
fbebbec585 Wordpiece can fail 2019-12-11 07:30:27 -05:00
a929a99e05 Steps of the pipeline can fail 2019-12-11 07:18:38 -05:00
7cb2fe2ea0 Bump version 2019-12-10 18:01:07 -05:00
b4b31d73cd Expose vocabulary size 2019-12-10 16:20:31 -05:00
6c294c60b0 Python - Add Encoding repr + improve example 2019-12-10 15:18:07 -05:00
99773d9ce4 Python - Add encoding getters 2019-12-10 15:17:41 -05:00
8cedc5f1f6 Update Python bindings for Encoding 2019-12-10 12:38:36 -05:00
132a0fc4b4 Improved Tokenizer interface 2019-12-10 11:41:54 -05:00
018f57f054 Python - Update example 2019-12-09 12:51:05 -05:00
849272d44f Python - add missing modules exports 2019-12-09 12:50:53 -05:00
3979096c52 Python - add BasicPreTokenizer 2019-12-09 12:50:09 -05:00
d60d24a378 Python - Add WordPiece model 2019-12-09 12:49:44 -05:00
5eba30835d Python - Add WordPiece decoder 2019-12-09 12:49:17 -05:00
0fb4b268f1 Add missing packages 2019-12-06 19:31:04 -05:00
ea9b75d6cd Add WordPiece decoder for Bert 2019-12-06 19:30:42 -05:00
3abdfaf852 Add WordPiece model for bert 2019-12-06 19:30:16 -05:00
030698530c Add BasicPreTokenizer for bert 2019-12-06 19:28:30 -05:00
c4bda752bd Fix wheels building on old versions
The CI env already had an installed version of setuptools. This old
version didn't support markdown for long description. So it built a
wrong wheel file, and twine complained.
cf. https://github.com/huggingface/tokenizers/runs/331944386
2019-12-03 17:41:43 -05:00