* WIP strip.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Rust StripNormalizer
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Allow to specify strip direction
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Renamed StripNormalizer to Strip
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added Python binding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Makes Strip python compatible with pythonic constructor.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Run RustFmt
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Clippy next ofc.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Move lstrip and rstrip on NormalizedString
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* implment strip() for normalizer + unittests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Add some more unittests on edge cases.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* clippy and fmt.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Simplify strip and fix offsets
* Python - Update strip bindings with default values
Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
* Expose num_added_tokens on Python side without the need to pass an Encoding to added_tokens.
This allows to compute the max sentence length for single/pair inputs without actually the need to have an Encoding structure.
As the number of added tokens is fixed and static during compilation it allows more flexible usage of the method.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Renamed num_added_tokens to num_special_tokens_to_add.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added lookup table model mapping string to id present in a vocab map.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* RustFmt
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Formatting.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix invalid void return on Rust side.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Python binding for LookupTable model
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Enable loading from Python's side.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Renamed LookupTable to WordLevel
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* RustFmt happy now.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* clippy happy now.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing mismatching names.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing mismatching names (one missing).
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix invalid method bindings on Python side.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Introduce factory function to create normalizer instance from the name of an unicode normalizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Rename BPETokenizer to CharBPETokenizer for clarity
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Give more flexibility in the way CharBPETokenizer handles normalizers creation.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Change .pyi file to reflection Normalizer hierarchy
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Make ByteLevelBPE as flexible for normalization than CharBPE.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added RobertaProcessor on Rust side.
Required to match the double separator token in the middle of pairs.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix typo in RobertaProcessing method declaration
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Correctly include RobertProcessor in the Python binding
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Roberta doesnt use token_type_ids so let's set everything to 0
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Attempt to make it works on Node side too.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* fix js bindings / `npm run lint`
* Make RustFmt happy.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>