- Deduplication : Removes duplicate spaces within strings
- Punctuation: Splits punctuation characters as isolated tokens
- Sequence: Applies a list of pretokenizers iteratively
* derive Clone on Tokenizer and AddedVocabulary.
* Replace Container with Arc wrapper for Decoders.
* Prefix Rust Decoder types with Py.
* Rename PyDecoder to CustomDecoder.
* Change panic in serializing custom decoder to exception.
* Re-enable training with cloneable Tokenizer.
* Remove unsound Container, use Arc wrappers instead.
* prefix the Python types in Rust with Py, rename PyPretokenizer
to CustomPretokenizer
* remove unsound Container wrappers, replace with Arc
* change panic on trying to (de-)serialize custom pretokenizer to
exception
* Implement changes necessary from generic Model in Tokenizer.
* Temporarily disable training in Python since Clone can't be
derived for Model until all components have been replaced.
* Prefix Python types in Rust with Py.
* WIP strip.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Rust StripNormalizer
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Allow to specify strip direction
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Renamed StripNormalizer to Strip
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added Python binding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Makes Strip python compatible with pythonic constructor.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Run RustFmt
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Clippy next ofc.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Move lstrip and rstrip on NormalizedString
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* implment strip() for normalizer + unittests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Add some more unittests on edge cases.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* clippy and fmt.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Simplify strip and fix offsets
* Python - Update strip bindings with default values
Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
* Added lookup table model mapping string to id present in a vocab map.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* RustFmt
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Formatting.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix invalid void return on Rust side.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Python binding for LookupTable model
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Enable loading from Python's side.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Renamed LookupTable to WordLevel
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* RustFmt happy now.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* clippy happy now.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing mismatching names.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Addressing mismatching names (one missing).
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added RobertaProcessor on Rust side.
Required to match the double separator token in the middle of pairs.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Fix typo in RobertaProcessing method declaration
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Correctly include RobertProcessor in the Python binding
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Roberta doesnt use token_type_ids so let's set everything to 0
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Attempt to make it works on Node side too.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* fix js bindings / `npm run lint`
* Make RustFmt happy.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>
This is an attempt at actually exposing the same structure that we use in the Rust lib. This will allow Python to instantiate Model/PreTokenizer/... with their own arguments, combining everything without relying on parsed kwargs.