* refactor: lift cloning to caller
* refactor: do not elide lifetimes as in Rust 2018
* fix: unsound use of env::set_var, was breaking stdlib change to make unsafe
It is generally not safe to set env variables. The correct way to set a config
value that needs to be overridden is to hold a copy internal to the library and
only read from the environment.
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* add a small test
* styling
* style merge
* fix merge test
* fmt
* nits
* update tset
* Using serde (serde_pyo3) to get __str__ and __repr__ easily.
* Putting it within tokenizers, it needs to be too specific.
* Clippy is our friend.
* Ruff.
* Update the tests.
* Pretty sure this is wrong (#1589)
* Adding support for ellipsis.
* Fmt.
* Ruff.
* Fixing tokenizer.
---------
Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
* feature dependent test
* nit about 嗎
* update
* actuallyfix it
* update the test
add it
fix
* stub
* Update tokenizers/src/pre_tokenizers/byte_level.rs
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* skip failing test
* add normalizer to init
---------
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* Upgrade pyo3 to 0.15
Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>
* Upgrade pyo3 to 0.16
Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>
* Install Python before running cargo clippy
* Fix clippy warnings
* Use `PyArray_Check` instead of downcasting to `PyArray1<u8>`
* Enable `auto-initialize` of pyo3 to fix `cargo test
--no-default-features`
* Fix some test cases
Why do they change?
* Refactor and add SAFETY comments to `PyArrayUnicode`
Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData`
Co-authored-by: messense <messense@icloud.com>
* First pass on automatic stubbing our python files.
* And now modifying all rust docs to be visible in Pyi files.
* Better assert fail message.
* Fixing github workflow.
* Removing types not exported anymore.
* Fixing `Tokenizer` signature.
* Disabling auto __init__.py.
* Re-enabling some types.
* Don't overwrite non automated __init__.py
* Automated most __init__.py
* Restubbing after rebase.
* Fixing env for tests.
* Install blakc in the env.
* Use PY35 target in stub.py
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
* Fixing a bug where long tokenizer files would be incorrectly
deserialized
- Add a bunch of tests to check deserialization behaviour
- One tests also confirms current Single deserialization of Sequence.
* Better test locations for Windows + no file dependency in Python binding
Rust side.
* Adressing @n1t0 comments.
* Removing all pre_tokenizer logic from Unigram algorithm.
* Improving *a lot* the parity check.
- We can now detect a lot more errors
- Special cases have been added temporarily.
* Adding 2 new normalizers that mimick spm defaut's behavior.
* Adding `encoding_optimized` version of the `encode` algorithm.
- Removes Lattice allocation.
- Changes trie `common_prefix_search` to return an iterator to avoid
allocation of the full results.
* Trie<char> -> Trie<u8> Another improvement on speed.
* [WIP] Attempt to create a Precompiled Normalizer from SPM to be 100%
compliant with arbitrary models.
* Adding a new `Precompiled` Normalizer that is replacing `SpmNmtNfkc`.
- It will be used for direct compatiblity with `Spm` and replace all
their custom rules by using directly the normalizer spec embedded
within spm files, removing all need for any rules for us.
- We need `nom` dependency to parse the binary format of `spm`.
- We need to add `sentencepiece_model_pb2.py` file to be able to read
the proto file.
- We reimplemented their `Darts::DoubleArray` compact trie format.
* Fixing a bug with Precompiled normalizer.
* Fixing some edge cases (now in tests) with this weird precompiled
normalizer.
It seems a very handy crafted trie does not prevent from shooting
oneself in the foot. Sorry future reader.
* Keep API stable for this PR (change of the API should come later #409).
- Removed sentencepiece_model_pb2 from binding and add instructions to
make `from_spm` work.
* Adding model check in `from_spm`.
* Adressing @n1t0's comments.
* Adding a check to make sure alignments stay correct.
Also added a bit more documentation on how Precompiled works.
* Extracting `Precompiled` into it's own `spm_precompiled` crate.
* Using ranges in `do_nmt`.
* WIP strip.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Rust StripNormalizer
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Allow to specify strip direction
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Renamed StripNormalizer to Strip
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Added Python binding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Makes Strip python compatible with pythonic constructor.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Run RustFmt
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Clippy next ofc.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Move lstrip and rstrip on NormalizedString
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* implment strip() for normalizer + unittests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Add some more unittests on edge cases.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* clippy and fmt.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
* Simplify strip and fix offsets
* Python - Update strip bindings with default values
Co-authored-by: MOI Anthony <xn1t0x@gmail.com>