45c2d25a9f
Tokenizer can have added tokens
2019-12-16 18:21:51 -05:00
ee883c3fc7
Bump version for release
2019-12-13 18:18:07 -05:00
ed7e3999d2
Python - Fix some clippy warnings
2019-12-13 18:17:51 -05:00
1a604cdbee
Revert wrong change
2019-12-13 18:13:16 -05:00
6b1028d550
Add clippy warnings + fix all of them
2019-12-13 17:53:46 -05:00
24139d7324
Improve some Python classes
2019-12-13 17:53:46 -05:00
4914e6285e
add path to manifest
2019-12-13 17:53:32 -05:00
7f42417482
fix yaml
2019-12-13 17:53:32 -05:00
7e6fd92018
fix formatting
2019-12-13 17:53:32 -05:00
03406d0b54
add rustfmt and clippy to CI pipeline
2019-12-13 17:53:32 -05:00
dc48cc3680
fix a couple linting warnings
2019-12-13 17:53:32 -05:00
1c4593cad4
Python - Remove warning on unused Token
2019-12-13 15:28:48 -05:00
e93cc62a71
Python - Handle kwargs for bert modules
2019-12-13 15:28:29 -05:00
3355be89cd
Python - Update examples and improve errors
2019-12-13 14:37:29 -05:00
7cf4b3a6cd
Python - Rewrite PyDecoder and PyPreTokenizer
2019-12-13 12:20:25 -05:00
2a0ad97809
Python - Update API to allow failure
2019-12-13 12:20:05 -05:00
1c7be358b7
Python - Better error conversions
2019-12-13 12:14:27 -05:00
7711946882
Add some tests for Encoding
2019-12-12 19:03:42 -05:00
da45a1d6d0
Extract encoding
2019-12-12 18:04:42 -05:00
5bf8baec68
Prepare tokenizer module for multiple files
2019-12-12 18:04:42 -05:00
34ffe6dc1a
Add Bert PostProcessor
2019-12-12 18:04:42 -05:00
f4cd78e98a
Add truncation ability
2019-12-12 18:04:42 -05:00
13df36ca55
fix error display
2019-12-12 10:50:36 -05:00
3bdb849bb3
Fix cli + whitespace
2019-12-11 07:31:28 -05:00
4807894da6
BPE can fail
2019-12-11 07:30:51 -05:00
fbebbec585
Wordpiece can fail
2019-12-11 07:30:27 -05:00
a929a99e05
Steps of the pipeline can fail
2019-12-11 07:18:38 -05:00
7cb2fe2ea0
Bump version
2019-12-10 18:01:07 -05:00
b4b31d73cd
Expose vocabulary size
2019-12-10 16:20:31 -05:00
6c294c60b0
Python - Add Encoding repr + improve example
2019-12-10 15:18:07 -05:00
99773d9ce4
Python - Add encoding getters
2019-12-10 15:17:41 -05:00
8cedc5f1f6
Update Python bindings for Encoding
2019-12-10 12:38:36 -05:00
132a0fc4b4
Improved Tokenizer interface
2019-12-10 11:41:54 -05:00
018f57f054
Python - Update example
2019-12-09 12:51:05 -05:00
849272d44f
Python - add missing modules exports
2019-12-09 12:50:53 -05:00
3979096c52
Python - add BasicPreTokenizer
2019-12-09 12:50:09 -05:00
d60d24a378
Python - Add WordPiece model
2019-12-09 12:49:44 -05:00
5eba30835d
Python - Add WordPiece decoder
2019-12-09 12:49:17 -05:00
0fb4b268f1
Add missing packages
2019-12-06 19:31:04 -05:00
ea9b75d6cd
Add WordPiece decoder for Bert
2019-12-06 19:30:42 -05:00
3abdfaf852
Add WordPiece model for bert
2019-12-06 19:30:16 -05:00
030698530c
Add BasicPreTokenizer for bert
2019-12-06 19:28:30 -05:00
c4bda752bd
Fix wheels building on old versions
...
The CI env already had an installed version of setuptools. This old
version didn't support markdown for long description. So it built a
wrong wheel file, and twine complained.
cf. https://github.com/huggingface/tokenizers/runs/331944386
2019-12-03 17:41:43 -05:00
c46ec97855
Update README
2019-12-03 17:26:20 -05:00
75232c0f06
Fix setup.py
2019-12-03 16:20:20 -05:00
499f5507df
Bump versions for 0.0.3 release
2019-12-03 16:11:45 -05:00
ec2ed483a3
Improve python readme with training example
2019-12-03 16:11:03 -05:00
eaafb22511
Add bindings for Trainer in Python
2019-12-03 15:54:15 -05:00
310a2af76b
Add BPE empty constructor
2019-12-03 15:39:54 -05:00
0324beea57
BpeTrainer is a Trainer
2019-12-03 15:39:33 -05:00