Fixing a bug where long tokenizer files would be incorrectly deserialized (#459)

* Fixing a bug where long tokenizer files would be incorrectly deserialized - Add a bunch of tests to check deserialization behaviour - One tests also confirms current Single deserialization of Sequence. * Better test locations for Windows + no file dependency in Python binding Rust side. * Adressing @n1t0 comments.
2025-12-04 19:58:21 +00:00 · 2020-10-13 18:44:24 +02:00
parent b3c016cf9c
commit 88556790e7
8 changed files with 72 additions and 11 deletions
--- a/bindings/python/tests/utils.py
+++ b/bindings/python/tests/utils.py
@@ -75,6 +75,13 @@ def train_files(data_dir):
    }


+@pytest.fixture(scope="session")
+def albert_base(data_dir):
+    return download(
+        "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-tokenizer.json"
+    )
+
+
 def multiprocessing_with_parallelism(tokenizer, enabled: bool):
    """
    This helper can be used to test that disabling parallelism avoids dead locks when the