Commit Graph

53 Commits

Author SHA1 Message Date
6ea758872d Unsound call of set_var (#1664)
* refactor: lift cloning to caller

* refactor: do not elide lifetimes as in Rust 2018

* fix: unsound use of env::set_var, was breaking stdlib change to make unsafe

It is generally not safe to set env variables. The correct way to set a config
value that needs to be overridden is to hold a copy internal to the library and
only read from the environment.
2024-10-25 15:44:30 +02:00
49dafd707e Fix strip python type (#1602)
* update

* the fix

* Revert "update"

This reverts commit 4c2f32f116479b0ec8ccd7c832f86cbc8787d8a9.

* add a test and rebase

* style

* oups
2024-08-07 15:36:28 +02:00
bded212356 Support None to reset pre_tokenizers and normalizers, and index sequences (#1590)
* initial commit

* support None

* fix clippy

* cleanup

* clean?

* propagate to pre_tokenizer

* fix test

* fix rust tests

* fix node

* propagate to decoder and post processor

* fix calls

* lint

* fmt

* node be happy I am fixing you

* initial commit

* support None

* fix clippy

* cleanup

* clean?

* propagate to pre_tokenizer

* fix test

* fix rust tests

* fix node

* propagate to decoder and post processor

* fix calls

* lint

* fmt

* node be happy I am fixing you

* add a small test

* styling

* style merge

* fix merge test

* fmt

* nits

* update tset
2024-08-07 12:52:35 +02:00
ab9c7ded8b Using serde (serde_pyo3) to get __str__ and __repr__ easily. (#1588)
* Using serde (serde_pyo3) to get __str__ and __repr__ easily.

* Putting it within tokenizers, it needs to be too specific.

* Clippy is our friend.

* Ruff.

* Update the tests.

* Pretty sure this is wrong (#1589)

* Adding support for ellipsis.

* Fmt.

* Ruff.

* Fixing tokenizer.

---------

Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com>
2024-08-07 12:08:29 +02:00
a010f6b75c Revert "Using serde (serde_pyo3) to get __str__ and __repr__ easily."
This reverts commit 86138337fc.
2024-08-02 18:42:57 +02:00
86138337fc Using serde (serde_pyo3) to get __str__ and __repr__ easily. 2024-08-02 18:41:54 +02:00
4ea2f235b0 Add bytelevel normalizer to fix decode when adding tokens to BPE (#1555)
* feature dependent test

* nit about 嗎

* update

* actuallyfix it

* update the test

add it

fix

* stub

* Update tokenizers/src/pre_tokenizers/byte_level.rs

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* skip failing test

* add normalizer to init

---------

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
2024-07-15 12:12:03 +02:00
d5a8cc7a49 PyO3 0.21. (#1494)
* PyO3 0.21.

* Upgraded everything.

* Rustfmt.
2024-04-16 13:49:52 +02:00
540bf2eb01 pyo3: update to 0.19 (#1322)
* Bump pyo3 dependency versions

* Fix deprecation warnings from pyo3

---------

Co-authored-by: Mike Lui <mikelui@meta.com>
2023-08-16 18:40:32 +02:00
d2c8190a0f Creating normalizers.Prepend (To be used instead of Metaspace). (#1194)
* Creating `normalizers.Prepend` (To be used instead of `Metaspace`).

* Linting + stub.

* Fixing pickling/unpickling by setting a default.

* Black.
2023-03-24 00:33:31 +01:00
5c18ec5ff5 pyo3 v0.18 migration (#1173)
* pyo v0.18 migration

* Fix formatting issues of black
2023-03-08 11:27:47 +01:00
11bb2e00f2 Add python 3.11 to manylinux buildwheels (#1096)
* Add python 3.11 to manylinux buildwheels

* Fixing clippy.

* Node clippy.

* Python clippy.

* Changelog + version number update.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-11-07 08:45:04 +01:00
8129dd3309 pyo3: update to 0.17 (#1066)
* python: update bindings to edition 2021

* python: update to pyo3 0.17

* Updating testing.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2022-10-05 16:59:01 +02:00
519cc13be0 Upgrade pyo3 to 0.16 (#956)
* Upgrade pyo3 to 0.15

Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>

* Upgrade pyo3 to 0.16

Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>

* Install Python before running cargo clippy

* Fix clippy warnings

* Use `PyArray_Check` instead of downcasting to `PyArray1<u8>`

* Enable `auto-initialize` of pyo3 to fix `cargo test
--no-default-features`

* Fix some test cases

Why do they change?

* Refactor and add SAFETY comments to `PyArrayUnicode`

Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData`

Co-authored-by: messense <messense@icloud.com>
2022-05-05 15:48:40 +02:00
256a71c1f2 Clippy 1.54. (#773) 2021-08-11 14:43:49 +02:00
56a9196030 Fix clippy warnings 2021-03-16 12:32:06 -04:00
db22cb6315 Python - Fix Normalizer.normalize with PyNormalizedStringRefMut 2021-02-03 15:48:53 -05:00
817c5ad317 Fix clippy warnings for rust 1.49 2021-01-06 15:03:33 -05:00
5c35fafc44 Python - Decoders can get/set their attributes 2020-11-27 17:35:34 -05:00
2feccdbbfa Python - PyStrip can get/set its attributes 2020-11-27 17:35:34 -05:00
7512d5e4ce Python - PyBertNormalizer can get/set its attributes 2020-11-27 17:35:34 -05:00
c22cfc31f9 Python - PyNormalizer & PyPreTokenizer use a RwLock 2020-11-27 17:35:34 -05:00
5842b3db73 Python - Improve normalizers docs 2020-11-23 11:52:51 -05:00
352c92ad33 Automatically stubbing the pyi files while keeping inspecting ability (#509)
* First pass on automatic stubbing our python files.

* And now modifying all rust docs to be visible in Pyi files.

* Better assert fail message.

* Fixing github workflow.

* Removing types not exported anymore.

* Fixing `Tokenizer` signature.

* Disabling auto __init__.py.

* Re-enabling some types.

* Don't overwrite non automated __init__.py

* Automated most __init__.py

* Restubbing after rebase.

* Fixing env for tests.

* Install blakc in the env.

* Use PY35 target in stub.py

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-17 15:13:00 -05:00
88556790e7 Fixing a bug where long tokenizer files would be incorrectly deserialized (#459)
* Fixing a bug where long tokenizer files would be incorrectly
deserialized

- Add a bunch of tests to check deserialization behaviour
- One tests also confirms current Single deserialization of Sequence.

* Better test locations for Windows + no file dependency in Python binding
Rust side.

* Adressing @n1t0 comments.
2020-10-13 18:44:24 +02:00
8308508577 Python - Update bindings for Replace Normalizer 2020-09-24 08:05:57 -04:00
b6e7a6e2f7 Python - Update PyNormalizer interface 2020-09-23 15:50:01 -04:00
8d04b22278 Python - Add support for custom Normalizer 2020-09-23 15:50:01 -04:00
940f8bd8fa Update PyO3 (#426) 2020-09-22 12:00:20 -04:00
aea22a4004 Adding node bindings.
- simplify normalizer.
- simplify python bindings.
2020-09-18 12:24:39 +02:00
792d618006 Adding a new "Replace" normalizer that takes a string and replaces it
with another String (for now).
2020-09-18 12:24:39 +02:00
75464734df Adding a new normalizer that strips accents by removing combining (#416)
* Adding a new normalizer that strips accents by removing combining

characters in unicode strings.

* Adding Node bindings

+ better normalizer impl.

* Doc comment -> Regular comment.
2020-09-17 09:49:41 +02:00
330876ae02 Improvements on spm parity: (#401)
* Removing all pre_tokenizer logic from Unigram algorithm.

* Improving *a lot* the parity check.

- We can now detect a lot more errors
- Special cases have been added temporarily.

* Adding 2 new normalizers that mimick spm defaut's behavior.

* Adding `encoding_optimized` version of the `encode` algorithm.

- Removes Lattice allocation.
- Changes trie `common_prefix_search` to return an iterator to avoid
  allocation of the full results.

* Trie<char> -> Trie<u8> Another improvement on speed.

* [WIP] Attempt to create a Precompiled Normalizer from SPM to be 100%
compliant with arbitrary models.

* Adding a new `Precompiled` Normalizer that is replacing `SpmNmtNfkc`.

- It will be used for direct compatiblity with `Spm` and replace all
their custom rules by using directly the normalizer spec embedded
within spm files, removing all need for any rules for us.
- We need `nom` dependency to parse the binary format of `spm`.
- We need to add `sentencepiece_model_pb2.py` file to be able to read
  the proto file.
- We reimplemented their `Darts::DoubleArray` compact trie format.

* Fixing a bug with Precompiled normalizer.

* Fixing some edge cases (now in tests) with this weird precompiled
normalizer.

It seems a very handy crafted trie does not prevent from shooting
oneself in the foot. Sorry future reader.

* Keep API stable for this PR (change of the API should come later #409).

- Removed sentencepiece_model_pb2 from binding and add instructions to
make `from_spm` work.

* Adding model check in `from_spm`.

* Adressing @n1t0's comments.

* Adding a check to make sure alignments stay correct.

Also added a bit more documentation on how Precompiled works.

* Extracting `Precompiled` into it's own `spm_precompiled` crate.

* Using ranges in `do_nmt`.
2020-09-15 22:21:02 +02:00
df827d538f Adding clippy as a linter within the Python binding. (#388)
* Adding clippy as a linter within the Python binding.

* Missing clippy (dropped commit ??)
2020-09-04 09:09:02 -04:00
52082b5476 New clippy comments? 2020-09-02 16:32:50 +02:00
504d8c85d8 Remove Tokenizer::normalize
This is actually a legacy function that doesn't really make sense now, and is getting really difficult to keep. So we remove it.
2020-08-19 12:42:12 -04:00
16f75d9efc Ensure serialization works in all expected ways. 2020-08-04 15:59:33 -04:00
08b8c48127 Remove Container from Normalizers, replace with Arc.
* prefix the Python types in Rust with Py
* remove unsound Container wrappers, replace with Arc
2020-08-04 15:59:33 -04:00
7a95ffc4fa BertNormalizer has same behavior than original implem 2020-07-06 13:55:18 -04:00
c5bba91bf4 Python - Test and fix classes pickling 2020-05-27 13:46:37 -04:00
6a70162d78 Python - Make all relevant classes pickable 2020-05-27 13:46:37 -04:00
be7b345bcd Require Send for all parts of the tokenizer (#222) 2020-04-08 13:35:06 -04:00
550413f00a add Send + Sync on all traits, remove elsewhere 2020-04-08 18:43:23 +02:00
2dc48e56ac Python - Update pyo3 version
* Use __new__ instead of static method as model constructors
2020-04-06 21:20:16 +02:00
afe9cfe96e Strip should inherits from Normalizer on Python binding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-03-31 20:20:09 +02:00
f263d7651f Python - RustFmt 2020-02-18 15:07:34 -05:00
bb8321ac0d Add Strip normalizer (#140)
* WIP strip.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Rust StripNormalizer

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Allow to specify strip direction

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Renamed StripNormalizer to Strip

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added Python binding.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Makes Strip python compatible with pythonic constructor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Run RustFmt

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Clippy next ofc.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move lstrip and rstrip on NormalizedString

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* implment strip() for normalizer + unittests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add some more unittests on edge cases.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* clippy and fmt.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Simplify strip and fix offsets

* Python - Update strip bindings with default values

Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
2020-02-17 11:26:40 +01:00
0e5d81b400 Implement __new__ on Normalizers
__new__ allows Normalizers to be initialized as normal python
objects. This also means that Normalizers are given the correct class
name.
2020-02-10 10:43:19 +01:00
5bc1e2ee05 Add Lowercase Normalizer 2020-01-07 19:40:19 -05:00
185b6f0b8b Add Sequence Normalizer 2020-01-06 21:03:05 -05:00