Commit Graph

  • 49a67824ce small wording changes (#39) Evan Pete Walsh 2020-01-07 05:33:59 -08:00
  • b06681cb1e Bump version for release Anthony MOI 2020-01-06 21:05:01 -05:00
  • 185b6f0b8b Add Sequence Normalizer Anthony MOI 2020-01-06 21:03:05 -05:00
  • 5c02bbbc4c Add basic unicode normalizers Anthony MOI 2020-01-06 20:38:42 -05:00
  • 4b9ae66419 WordPiece decoder with customizable prefix Anthony MOI 2020-01-06 20:20:42 -05:00
  • 742974f0c9 Fix default for WordPieceTrainerBuilder Anthony MOI 2020-01-06 20:08:12 -05:00
  • 772d0680b6 Python - Update all typings Anthony MOI 2020-01-06 20:03:00 -05:00
  • 0079a7a6b7 Python - Add NormalizedString + doc/typings Anthony MOI 2020-01-06 17:55:22 -05:00
  • 6de04bbaea Python - Add typings/doc for Encoding Anthony MOI 2020-01-06 17:23:04 -05:00
  • 7e9e0aa81c Python - Add Tokenizer doc with stub file Anthony MOI 2020-01-06 16:40:27 -05:00
  • 9a99e2bcb1 Python - Add missing Bpe constructor kwargs Anthony MOI 2020-01-06 16:39:59 -05:00
  • b7d0acc562 Python - Improve decode/decode_batch API Anthony MOI 2020-01-06 16:39:36 -05:00
  • 1a083a6e6f Python - Improved stub file for models Anthony MOI 2020-01-06 15:55:00 -05:00
  • 0e41e0b327 Python - Include correct packages and stubs Anthony MOI 2020-01-06 15:24:17 -05:00
  • 70a17ce330 Ignore some more python files Anthony MOI 2020-01-06 15:23:21 -05:00
  • 8723f78e6f Python - build-sdist.sh +x mode Anthony MOI 2020-01-06 14:24:08 -05:00
  • 5b8cd00d21 Ignore Python package files Anthony MOI 2020-01-06 13:04:45 -05:00
  • d7b6385566 Python - Adding some stub files Anthony MOI 2020-01-06 13:04:30 -05:00
  • 7eebd06409 Python - Improve imports Anthony MOI 2020-01-06 12:03:01 -05:00
  • e1caacfce0 Rename package for crates.io Anthony MOI 2020-01-04 23:42:32 -05:00
  • 9428b9a21b Documentation updates Anthony MOI 2020-01-04 23:33:50 -05:00
  • 627c304721 Create LICENSE MOI Anthony 2020-01-04 23:31:02 -05:00
  • 805dc58949 Update training to include new lines Anthony MOI 2020-01-03 20:23:58 -05:00
  • a1891387ed Merge pull request #38 from huggingface/wordpiece-training MOI Anthony 2020-01-03 19:56:33 -05:00
  • fab4e96b51 Python - Add bert wordpiece training example Anthony MOI 2020-01-03 16:51:39 -05:00
  • 6e3efe8954 Fix WordPiece model saving Anthony MOI 2020-01-03 16:49:27 -05:00
  • c51e340492 Python - Add WordPieceTrainer Anthony MOI 2020-01-03 16:37:36 -05:00
  • e64b54b29e Python - Update BpeTrainer interface Anthony MOI 2020-01-03 16:33:11 -05:00
  • 1dda76659f Add WordPieceTrainer Anthony MOI 2020-01-03 16:27:36 -05:00
  • 1bfe9fd0a7 Add BpeTrainerBuilder Anthony MOI 2020-01-03 16:09:47 -05:00
  • 02a89bb07f Wordpiece handles prefix customization Anthony MOI 2020-01-03 15:30:55 -05:00
  • dc8266236d Can create a WordPiece from a BPE Anthony MOI 2020-01-03 15:24:48 -05:00
  • 5141297204 BPE also handles some prefix and suffix options Anthony MOI 2020-01-03 15:00:33 -05:00
  • 7edb00c4a0 BpeTrainer handles some prefix and suffix options Anthony MOI 2020-01-03 14:40:13 -05:00
  • 703f1f16b3 Merge pull request #34 from huggingface/improve-cache MOI Anthony 2020-01-03 19:35:54 -05:00
  • 7766434ce5 Merge pull request #36 from huggingface/benchmarks MOI Anthony 2020-01-03 19:26:07 -05:00
  • 5c473aaee9 avoid re-inserted existing words into cache epwalsh 2020-01-03 15:26:11 -08:00
  • eca1c8be4b Merge branch 'master' into improve-cache epwalsh 2020-01-03 15:18:04 -08:00
  • 5a8b7a972b refactor and rename benchmarks epwalsh 2020-01-03 15:16:44 -08:00
  • 246d87dc7d add no cache benchmarks epwalsh 2020-01-03 11:56:35 -08:00
  • 1ed2a4f59b make cache optional (#37) Evan Pete Walsh 2020-01-03 11:48:13 -08:00
  • 1f961aa310 try_clear_cache -> clear_cache epwalsh 2020-01-03 11:39:19 -08:00
  • 6b238a77b9 use single tokenizer across benchmarks epwalsh 2020-01-03 08:57:32 -08:00
  • 60540b04f5 avoid unnecessary write locks epwalsh 2020-01-03 08:25:55 -08:00
  • 477ce4e473 clean up benchmarks (#26) Evan Pete Walsh 2020-01-03 07:43:51 -08:00
  • c5359ddd47 Fix benchmarks Anthony MOI 2020-01-02 20:05:28 -05:00
  • 408490e6b4 Add missing kwargs support Anthony MOI 2020-01-02 19:32:56 -05:00
  • 22e499133b Python - Expose missing BPE options at creation Anthony MOI 2020-01-02 19:30:50 -05:00
  • 04cfeea2d5 Python - ByteLevel BPE training example file Anthony MOI 2020-01-02 18:39:31 -05:00
  • 0589deb6e2 Python - Expose BpeTrainer options Anthony MOI 2020-01-02 18:09:04 -05:00
  • d3c3f5a700 Python - Expose ByteLevel alphabet Anthony MOI 2020-01-02 18:06:06 -05:00
  • f0f9aefd07 ByteLevel exposes its alphabet Anthony MOI 2020-01-02 17:51:26 -05:00
  • 7b12b3cca5 BpeTrainer handles initial alphabet Anthony MOI 2020-01-02 15:01:22 -05:00
  • c8a5d2e32a NormalizedString - Fix removal around edges Anthony MOI 2020-01-02 14:17:00 -05:00
  • 66b6211705 NormalizedString - Fix added chars at beginning Anthony MOI 2020-01-02 14:16:14 -05:00
  • 894ea1f8f0 utilize ::new() in ::default() epwalsh 2020-01-02 10:56:41 -08:00
  • 8ae0f2efdb set capacity on BPE cache, change Mutex to RwLock, create BpeBuilder (#24) Evan Pete Walsh 2020-01-02 09:26:50 -08:00
  • e3cf6a7b00 refactor benchmarks (#25) Evan Pete Walsh 2020-01-01 17:07:36 -08:00
  • 138c48d92e add benchmark on many batches epwalsh 2020-01-01 16:20:19 -08:00
  • b09511f5cf add better single threaded GPT2 benchmark epwalsh 2020-01-01 15:48:53 -08:00
  • 722b61230d BPE handles UNK token Anthony MOI 2020-01-01 14:49:03 -05:00
  • 75713ce809 Merge pull request #23 from huggingface/cache MOI Anthony 2020-01-01 14:47:28 -05:00
  • 65471b4f2c Merge branch 'master' into cache epwalsh 2020-01-01 14:10:20 -05:00
  • 9a10acc981 don't create unnecessary vectors when accessing cache epwalsh 2020-01-01 14:06:31 -05:00
  • a5c5e5840f Oops - Fix trainer Anthony MOI 2020-01-01 13:36:42 -05:00
  • a7a5f9a67f BpeTrainer handles special tokens and limiting alphabet Anthony MOI 2020-01-01 12:54:41 -05:00
  • ebf22198f3 Add benchmark framework and benches for BPE (GPT2) (#4) Evan Pete Walsh 2020-01-01 07:35:57 -08:00
  • 84c7a8623a Remove all printed logs Anthony MOI 2020-01-01 01:45:24 -05:00
  • 47e4b00e05 BpeTrainer shows some progress Anthony MOI 2020-01-01 01:28:17 -05:00
  • f3aef0e4e6 Fix BPE saving (u32 => String) Anthony MOI 2019-12-31 23:15:10 -05:00
  • 90dfdc715d Expose Tokenizer parts Anthony MOI 2019-12-31 22:57:47 -05:00
  • 90df088054 Fix ByteLevel PreTokenizer Anthony MOI 2019-12-31 15:09:51 -05:00
  • f28ca58fd9 [Fix #17] BPE & WordPiece models saving Anthony MOI 2019-12-31 13:56:28 -05:00
  • 2125e4d422 Merge pull request #21 from huggingface/dropout MOI Anthony 2019-12-30 19:39:29 -05:00
  • b21a5496a7 no cache when dropout epwalsh 2019-12-30 15:58:16 -08:00
  • a642807fde fix clippy warnings epwalsh 2019-12-30 14:23:32 -08:00
  • fdb8ffca27 fix comment epwalsh 2019-12-30 14:18:08 -08:00
  • b28c3fd04c add doc epwalsh 2019-12-30 14:15:26 -08:00
  • 0be9e5a7f0 implement dropout for BPE epwalsh 2019-12-30 14:14:26 -08:00
  • 5194daa0ce Merge pull request #20 from huggingface/docs MOI Anthony 2019-12-30 14:17:14 -05:00
  • d163bbadae remove redundant headers, other small cleanups epwalsh 2019-12-30 10:46:56 -08:00
  • 225a886382 Python - Expose Whitespace PreTokenizer Anthony MOI 2019-12-30 13:10:33 -05:00
  • 4677a09626 Python - Expose pad and truncate on Encoding Anthony MOI 2019-12-30 12:56:07 -05:00
  • 8ddb2de64e Update unicode-normalization to published crate Anthony MOI 2019-12-30 12:18:00 -05:00
  • f5327f977e Merge pull request #19 from huggingface/handle-offsets MOI Anthony 2019-12-30 10:46:30 -05:00
  • 06d515d41b Python - Add ability to retrieve a range of string Anthony MOI 2019-12-29 01:37:03 -05:00
  • 049029dc42 Python - Restore methods on Encoding Anthony MOI 2019-12-29 01:26:42 -05:00
  • 708a63514a Add ability to retrieve ranges or NormalizedString Anthony MOI 2019-12-29 01:22:16 -05:00
  • 9c574ad1b7 Python - Fix some import warnings Anthony MOI 2019-12-29 00:43:04 -05:00
  • 3779bf3e19 Python - Update example Anthony MOI 2019-12-29 00:38:37 -05:00
  • 3dcf9f763c Python - Update pre tokenizers with offsets Anthony MOI 2019-12-29 00:37:58 -05:00
  • 3f79d9d5e0 Python - Add normalizers bindings & BertNormalizer Anthony MOI 2019-12-29 00:36:09 -05:00
  • 81be029881 Fix - Handle errors during normalization Anthony MOI 2019-12-29 00:24:01 -05:00
  • 79b96dccd0 Fix lowercase/uppercase normalization Anthony MOI 2019-12-29 00:19:49 -05:00
  • 22ffa716a1 BertPreTokenizer pre tokenize only (with offsets) Anthony MOI 2019-12-29 00:12:24 -05:00
  • cda9fae992 Add BertNormalizer with offsets tracking Anthony MOI 2019-12-29 00:10:45 -05:00
  • ad9cc52d83 ByteLevel PreTokenizer handles offsets Anthony MOI 2019-12-29 00:08:42 -05:00
  • 35a8dfdd55 Whitespace PreTokenizer handles offsets Anthony MOI 2019-12-28 15:50:42 -05:00
  • be00a1e45e Improve clarity for BertProcessing Anthony MOI 2019-12-28 15:45:51 -05:00
  • d7af007539 BertProcessor handles NormalizedString merging Anthony MOI 2019-12-28 15:30:57 -05:00