Fixing a bug where long tokenizer files would be incorrectly deserialized (#459)

* Fixing a bug where long tokenizer files would be incorrectly deserialized - Add a bunch of tests to check deserialization behaviour - One tests also confirms current Single deserialization of Sequence. * Better test locations for Windows + no file dependency in Python binding Rust side. * Adressing @n1t0 comments.
2025-12-05 20:28:22 +00:00 · 2020-10-13 18:44:24 +02:00
parent b3c016cf9c
commit 88556790e7
8 changed files with 72 additions and 11 deletions
--- a/bindings/python/tests/test_serialization.py
+++ b/bindings/python/tests/test_serialization.py
@@ -0,0 +1,10 @@
+from tokenizers import Tokenizer, models, normalizers
+from .utils import data_dir, albert_base
+
+
+class TestSerialization:
+    def test_full_serialization_albert(self, albert_base):
+        # Check we can read this file.
+        # This used to fail because of BufReader that would fail because the
+        # file exceeds the buffer capacity
+        tokenizer = Tokenizer.from_file(albert_base)