Files
tokenizers/bindings/python/examples/train_with_datasets.py
Nicolas Patry 6113666624 Updating python formatting. (#1079)
* Updating python formatting.

* Forgot gh action.

* Skipping isort to prevent circular imports.

* Updating stub.

* Removing `isort` (it contradicts `stub.py`).

* Fixing weird stub black/isort disagreeement.
2022-10-05 15:29:33 +02:00

23 lines
669 B
Python

import datasets
from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, trainers
# Build a tokenizer
bpe_tokenizer = Tokenizer(models.BPE())
bpe_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
bpe_tokenizer.normalizer = normalizers.Lowercase()
# Initialize a dataset
dataset = datasets.load_dataset("wikitext", "wikitext-103-raw-v1")
# Build an iterator over this dataset
def batch_iterator():
batch_length = 1000
for i in range(0, len(dataset["train"]), batch_length):
yield dataset["train"][i : i + batch_length]["text"]
# And finally train
bpe_tokenizer.train_from_iterator(batch_iterator(), length=len(dataset["train"]))