Fixing a bug where long tokenizer files would be incorrectly deserialized (#459)

* Fixing a bug where long tokenizer files would be incorrectly
deserialized

- Add a bunch of tests to check deserialization behaviour
- One tests also confirms current Single deserialization of Sequence.

* Better test locations for Windows + no file dependency in Python binding
Rust side.

* Adressing @n1t0 comments.
This commit is contained in:
Nicolas Patry
2020-10-13 18:44:24 +02:00
committed by GitHub
parent b3c016cf9c
commit 88556790e7
8 changed files with 72 additions and 11 deletions

View File

@ -0,0 +1,10 @@
from tokenizers import Tokenizer, models, normalizers
from .utils import data_dir, albert_base
class TestSerialization:
def test_full_serialization_albert(self, albert_base):
# Check we can read this file.
# This used to fail because of BufReader that would fail because the
# file exceeds the buffer capacity
tokenizer = Tokenizer.from_file(albert_base)