Commit Graph

  • f4df7f5e2a Update Tokenizer with NormalizedString & Encoding Anthony MOI 2019-12-28 15:28:44 -05:00
  • 4afcb1ef96 PreTokenizers handle offsets Anthony MOI 2019-12-28 15:28:21 -05:00
  • 8c40c89836 Encoding uses NormalizedString Anthony MOI 2019-12-28 15:25:50 -05:00
  • 162829b7a9 Introduce NormalizedString Anthony MOI 2019-12-28 15:24:09 -05:00
  • 96ef467bbf Use forked unicode-normalization Anthony MOI 2019-12-28 15:22:52 -05:00
  • a4beecf944 WordPiece handles offsets Anthony MOI 2019-12-28 15:22:03 -05:00
  • 5682627223 BPE handles offsets Anthony MOI 2019-12-28 15:21:50 -05:00
  • 5d9848ad6c Models handles offsets Anthony MOI 2019-12-28 15:21:29 -05:00
  • 839239d3b4 Bump version Anthony MOI 2019-12-27 10:43:34 -05:00
  • bddf7ba737 Python - Fix building from wheels Anthony MOI 2019-12-27 10:39:19 -05:00
  • ffd28ba558 Bump for release Anthony MOI 2019-12-26 14:56:13 -05:00
  • 74cc6f6bde Python - Simplify padding interface Anthony MOI 2019-12-26 14:34:13 -05:00
  • d1e59e09bf Fix a bug when adding special tokens Anthony MOI 2019-12-26 14:32:50 -05:00
  • d93d4fc3cd Python - Simplify truncation interface Anthony MOI 2019-12-26 10:35:20 -05:00
  • a7734ffc9f Python - Update doc and readme for add_prefix_space Anthony MOI 2019-12-26 10:34:53 -05:00
  • 1879cb0bcb Python - change with_added_tokens as kwarg Anthony MOI 2019-12-25 22:22:35 -05:00
  • 905c1eb77e Python - update some packages Anthony MOI 2019-12-25 22:16:43 -05:00
  • 597031b973 Python - remove unused variable Anthony MOI 2019-12-25 22:16:11 -05:00
  • 9d289d357d Python - change add_prefix_space as kwarg Anthony MOI 2019-12-25 22:10:17 -05:00
  • 4bc5a7bbe7 Python - fix example Anthony MOI 2019-12-24 11:20:40 -05:00
  • cf0e8917cd Fix whitespace handling in ByteLevel Anthony MOI 2019-12-24 11:20:26 -05:00
  • 9f1421a04b remove Cargo.lock (#7) Evan Pete Walsh 2019-12-23 21:22:42 -08:00
  • c0ed873c4d simplify initialization of BpeTrainer epwalsh 2019-12-23 14:38:54 -08:00
  • fab1d4cabc Bump version for release Anthony MOI 2019-12-23 17:28:38 -05:00
  • e01d4f2052 Python - Remove misleading __repr__ Anthony MOI 2019-12-23 17:27:59 -05:00
  • 2159123d7c Fix truncate Anthony MOI 2019-12-23 17:27:43 -05:00
  • 8fb94be3d0 Merge pull request #6 from huggingface/BPE-tests MOI Anthony 2019-12-20 15:34:38 -05:00
  • 9a91016877 Merge branch 'master' into BPE-tests Evan Pete Walsh 2019-12-20 08:55:41 -08:00
  • 2266960ef7 Bump version and update Readme Anthony MOI 2019-12-20 10:26:40 -05:00
  • f2b9c30ad9 Handle vocab size with added tokens Anthony MOI 2019-12-19 20:19:56 -05:00
  • b7040e0412 Option to skip special tokens while decoding Anthony MOI 2019-12-19 20:03:02 -05:00
  • a8d68d516d Handle special tokens Anthony MOI 2019-12-19 19:48:16 -05:00
  • 7f032b62df Include the added tokens while converting tokens and ids Anthony MOI 2019-12-19 18:32:37 -05:00
  • 076ba297fb Cannot add new tokens that already exist in the vocab Anthony MOI 2019-12-19 18:32:03 -05:00
  • 6d51e7a393 add example / doc test for BPE trainer epwalsh 2019-12-19 15:28:58 -08:00
  • 69212e17e9 formatting epwalsh 2019-12-19 15:07:27 -08:00
  • a16daa78f1 add test for word merge epwalsh 2019-12-19 14:45:38 -08:00
  • 184b09e3ac add more tests epwalsh 2019-12-18 17:40:13 -08:00
  • 1dc0debe36 add initial test epwalsh 2019-12-18 16:45:11 -08:00
  • 9763282d59 Bump version for release Anthony MOI 2019-12-17 18:42:34 -05:00
  • 4d14b08afe ByteLevel handles prefix spaces Anthony MOI 2019-12-17 18:35:40 -05:00
  • 6766585965 Python - Do not expose non working features of Encoding Anthony MOI 2019-12-17 17:43:42 -05:00
  • 0a3d4a86a9 Python - Update bindings for BertPreTokenizer Anthony MOI 2019-12-17 17:40:56 -05:00
  • e54eee7657 BasicPreTokenizer => BertPreTokenizer Anthony MOI 2019-12-17 17:37:13 -05:00
  • 1b66d87fd3 BasicPreTokenizer handles do_basic_tokenize for Bert Anthony MOI 2019-12-17 17:35:26 -05:00
  • 3f95248d6d Python - Truncation & padding bindings Anthony MOI 2019-12-17 17:24:53 -05:00
  • 5729d3656a Tokenizer handles Truncation and Padding Anthony MOI 2019-12-17 15:15:58 -05:00
  • 4c51399b00 An Encoding can be padded Anthony MOI 2019-12-17 14:23:37 -05:00
  • 08eb163415 Bump version for release Anthony MOI 2019-12-16 19:38:33 -05:00
  • d80f752ec9 Python - Add some missing Encoding bindings Anthony MOI 2019-12-16 19:38:18 -05:00
  • cc9f9107fa Update cli with some example added tokens Anthony MOI 2019-12-16 18:50:40 -05:00
  • 036ee603f4 Python - Update example Anthony MOI 2019-12-16 18:50:21 -05:00
  • e4ce050b73 Fix BertProcessing overflowing usize Anthony MOI 2019-12-16 18:46:58 -05:00
  • 93a74aa53a Python - Expose PostProcessors Anthony MOI 2019-12-16 18:46:14 -05:00
  • 1a90cc96e5 Python - Can add tokens Anthony MOI 2019-12-16 18:45:26 -05:00
  • f92e73b8f3 Ability to decode with added tokens Anthony MOI 2019-12-16 18:22:46 -05:00
  • 4c7f6e1f04 Ability to encode with added tokens Anthony MOI 2019-12-16 18:22:17 -05:00
  • 45c2d25a9f Tokenizer can have added tokens Anthony MOI 2019-12-16 18:21:51 -05:00
  • ee883c3fc7 Bump version for release Anthony MOI 2019-12-13 18:18:07 -05:00
  • ed7e3999d2 Python - Fix some clippy warnings Anthony MOI 2019-12-13 18:17:51 -05:00
  • 1a604cdbee Revert wrong change Anthony MOI 2019-12-13 18:13:16 -05:00
  • 6b1028d550 Add clippy warnings + fix all of them Anthony MOI 2019-12-13 17:52:31 -05:00
  • 24139d7324 Improve some Python classes Anthony MOI 2019-12-13 16:35:25 -05:00
  • 4914e6285e add path to manifest epwalsh 2019-12-13 13:56:39 -08:00
  • 7f42417482 fix yaml epwalsh 2019-12-13 13:51:12 -08:00
  • 7e6fd92018 fix formatting epwalsh 2019-12-13 13:48:59 -08:00
  • 03406d0b54 add rustfmt and clippy to CI pipeline epwalsh 2019-12-13 13:44:53 -08:00
  • dc48cc3680 fix a couple linting warnings epwalsh 2019-12-13 13:09:25 -08:00
  • 1c4593cad4 Python - Remove warning on unused Token Anthony MOI 2019-12-13 15:28:48 -05:00
  • e93cc62a71 Python - Handle kwargs for bert modules Anthony MOI 2019-12-13 15:28:29 -05:00
  • 3355be89cd Python - Update examples and improve errors Anthony MOI 2019-12-13 14:37:29 -05:00
  • 7cf4b3a6cd Python - Rewrite PyDecoder and PyPreTokenizer Anthony MOI 2019-12-13 12:20:25 -05:00
  • 2a0ad97809 Python - Update API to allow failure Anthony MOI 2019-12-13 12:20:05 -05:00
  • 1c7be358b7 Python - Better error conversions Anthony MOI 2019-12-13 12:14:27 -05:00
  • 7711946882 Add some tests for Encoding Anthony MOI 2019-12-12 19:03:42 -05:00
  • da45a1d6d0 Extract encoding Anthony MOI 2019-12-12 18:03:58 -05:00
  • 5bf8baec68 Prepare tokenizer module for multiple files Anthony MOI 2019-12-12 17:24:54 -05:00
  • 34ffe6dc1a Add Bert PostProcessor Anthony MOI 2019-12-12 17:22:21 -05:00
  • f4cd78e98a Add truncation ability Anthony MOI 2019-12-12 17:19:31 -05:00
  • 13df36ca55 fix error display epwalsh 2019-12-12 07:02:46 -08:00
  • 3bdb849bb3 Fix cli + whitespace Anthony MOI 2019-12-11 07:31:28 -05:00
  • 4807894da6 BPE can fail Anthony MOI 2019-12-11 07:30:51 -05:00
  • fbebbec585 Wordpiece can fail Anthony MOI 2019-12-11 07:30:27 -05:00
  • a929a99e05 Steps of the pipeline can fail Anthony MOI 2019-12-11 07:18:38 -05:00
  • 7cb2fe2ea0 Bump version Anthony MOI 2019-12-10 18:01:07 -05:00
  • b4b31d73cd Expose vocabulary size Anthony MOI 2019-12-10 16:20:31 -05:00
  • 6c294c60b0 Python - Add Encoding repr + improve example Anthony MOI 2019-12-10 15:18:07 -05:00
  • 99773d9ce4 Python - Add encoding getters Anthony MOI 2019-12-10 15:17:41 -05:00
  • 8cedc5f1f6 Update Python bindings for Encoding Anthony MOI 2019-12-10 12:38:36 -05:00
  • 132a0fc4b4 Improved Tokenizer interface Anthony MOI 2019-12-10 11:41:54 -05:00
  • 018f57f054 Python - Update example Anthony MOI 2019-12-09 12:51:05 -05:00
  • 849272d44f Python - add missing modules exports Anthony MOI 2019-12-09 12:50:53 -05:00
  • 3979096c52 Python - add BasicPreTokenizer Anthony MOI 2019-12-09 12:50:09 -05:00
  • d60d24a378 Python - Add WordPiece model Anthony MOI 2019-12-09 12:49:44 -05:00
  • 5eba30835d Python - Add WordPiece decoder Anthony MOI 2019-12-09 12:49:17 -05:00
  • 0fb4b268f1 Add missing packages Anthony MOI 2019-12-06 19:31:04 -05:00
  • ea9b75d6cd Add WordPiece decoder for Bert Anthony MOI 2019-12-06 19:30:42 -05:00
  • 3abdfaf852 Add WordPiece model for bert Anthony MOI 2019-12-06 19:30:16 -05:00
  • 030698530c Add BasicPreTokenizer for bert Anthony MOI 2019-12-06 19:28:30 -05:00
  • c4bda752bd Fix wheels building on old versions Anthony MOI 2019-12-03 17:41:43 -05:00