Fixing a bug where long tokenizer files would be incorrectly deserialized (#459)

* Fixing a bug where long tokenizer files would be incorrectly
deserialized

- Add a bunch of tests to check deserialization behaviour
- One tests also confirms current Single deserialization of Sequence.

* Better test locations for Windows + no file dependency in Python binding
Rust side.

* Adressing @n1t0 comments.
This commit is contained in:
Nicolas Patry
2020-10-13 18:44:24 +02:00
committed by GitHub
parent b3c016cf9c
commit 88556790e7
8 changed files with 72 additions and 11 deletions

View File

@ -75,6 +75,13 @@ def train_files(data_dir):
}
@pytest.fixture(scope="session")
def albert_base(data_dir):
return download(
"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-tokenizer.json"
)
def multiprocessing_with_parallelism(tokenizer, enabled: bool):
"""
This helper can be used to test that disabling parallelism avoids dead locks when the