Commit Graph

43 Commits

Author SHA1 Message Date
152880ab3e Adding truncation_side within TruncationParams. (#860)
* Add truncation to enable_truncation

* Fix typo

* Adding truncation_side within `TruncationParams`.

* Node serialization of this direction param.

* Update the test.

* Fixing warnings/lint.

* Adding stuff (can't local debug :( )

* Slow loop... ;(

* Stub.py.

Co-authored-by: Niels Rogge <niels.rogge1@gmail.com>
2021-12-28 12:37:06 +01:00
04368b1998 Truncate Right (#841)
* feat(tokenizers): add truncate test case

* !feat(tokenizer): truncate right

* refacto(tokenizers): clippy

* feat(bindings): update bindings for truncate()

* fix(tokenizers): remove unsafe code

* refacto(tokenizers): truncate direction

* truncate direction enum
* compute parts ranges beforehand
* 2n space because encoding is dropped at the end of procedure
* update bindings
* add pip install in python bindings' make test

* fix(node): clippy asks to use unwrap_or_else

* fix(node): lint

* refacto(tokenizers): replace Vec<Range<usize>> by Vec<(usize, usize)>

* refacto(bindings): add match syntax

* refacto(tokenizers): use mem::replace instead of mem::swap

* refacto(tokenizers): assign value the normal way
2021-12-23 13:34:21 +01:00
884bfb7970 Prepare node release (#794)
* Node - Update changelog for release

* Update node release to add v14 & v15

Co-authored-by: Huan (李卓桓) <zixia@zixia.net>

* Node - Update version number

* Node - Update dependencies

* Node - Lint

Co-authored-by: Huan (李卓桓) <zixia@zixia.net>
2021-09-02 09:58:01 -04:00
d3d9f2c76b words -> word_ids & sequences -> sequence_ids 2020-11-09 16:02:07 -05:00
57d162b269 Add an Encoding.sequences to allow masking 2020-11-06 10:41:56 -05:00
385d25720a Simplify the API for Encoding.token_to_XXX 2020-11-06 10:41:56 -05:00
a79cc55e08 Node - Encoding mappings handle sequence_id 2020-11-06 10:41:56 -05:00
95cc8c47ad Changed rust api for merges, that is now Vec<(String, String)> 2020-09-24 08:57:02 +02:00
26cafe0d6c Fixing eslint. 2020-09-10 14:00:53 -04:00
a16d71abd0 Node - Update bindings 2020-08-19 12:42:12 -04:00
e9a2e63a67 Node - Fix new linting errors 2020-07-24 15:44:39 -04:00
a03eba2fe9 Node - Typings proposal 2020-05-27 13:12:47 -04:00
b5247f41f1 Node - Update base tokenizer 2020-05-12 18:08:26 -04:00
4aecd82d07 Node - Improve mappings on Encoding 2020-04-16 14:23:37 -04:00
38d53a7b84 Node - Expose more bindings 2020-04-13 16:48:32 -04:00
3ad1360210 Word indices are None for special tokens 2020-04-09 09:52:02 -04:00
e9667a7b83 Node - tokenizer.postProcess bindings 2020-03-26 15:42:45 -04:00
0408567f23 Node - Merge encodings 2020-03-26 15:42:45 -04:00
70552812fe Node - Bindings for tokenized encoding 2020-03-26 15:42:45 -04:00
ce3cf78ea5 Node - Bindings for Encoding mappings 2020-03-26 15:42:45 -04:00
7dd2400214 Node - Remove addSpecialTokens from BertWordPieceTokenizer 2020-03-26 15:10:08 -04:00
d25eb075c8 Node - Finalize AddedToken support 2020-03-25 12:36:03 -04:00
f53a885fdd Node - Expand AddedToken supported use 2020-03-25 11:12:29 -04:00
2aeae555e2 Node - Expose normalize on tokenizer 2020-03-18 17:10:26 -04:00
25ef729a5a Node - Update bindings 2020-03-18 15:13:29 -04:00
3abf615a51 Node - Update bindings 2020-03-10 18:22:36 -04:00
523e173ddf Merge pull request #188 from huggingface/fix-byte-level
Fix byte level BPE offsets
2020-03-10 14:37:47 -04:00
7764d3d770 Node - Fix bindings 2020-03-10 14:31:42 -04:00
45f3eaaf72 Update bindings and typings 2020-03-10 12:28:24 -04:00
efbbfea558 Update ByteLevel PostProcessor 2020-03-10 12:05:04 -04:00
aa62c951dc Node - Update bindings 2020-03-09 22:45:33 -04:00
4510ea5ce3 node: type errors 2020-03-06 18:01:11 -05:00
a44eb2b5cd node: update bytelevel bindings 2020-03-06 17:44:45 -05:00
dae345cc6d node: add continuingSubwordPrefix to wordpiece model 2020-03-06 16:30:36 -05:00
578eddcdf9 node: expose decode / decodeBatch in BaseTokenizer 2020-03-06 16:27:27 -05:00
ffcd5c63bf node: make BPE.fromfiles async 2020-03-06 16:27:27 -05:00
fe49512d37 node: make WordPiece.fromFiles async 2020-03-06 16:06:06 -05:00
917996841d node: "proxy" raw Encoding with getters 2020-02-26 18:15:16 -05:00
f836f2109b build ts 2020-01-14 15:20:34 -05:00
a95d0e6ba1 Node - Fix import 2020-01-10 16:11:44 -05:00
24c08b2530 fix sentencepiece tokenizer name 2020-01-10 16:03:47 -05:00
80f6d58177 big big big 2020-01-10 14:49:13 -05:00
6b0935d5de first implementations draft 2020-01-10 11:53:30 -05:00