tokenizers/docs/source-doc-builder/training_from_memory.mdx

# Training from memory

In the [Quicktour](quicktour.html), we saw how to build and train a
tokenizer using text files, but we can actually use any Python Iterator.
In this section we'll see a few different ways of training our
tokenizer.

For all the examples listed below, we'll use the same [`~tokenizers.Tokenizer`] and
[`~tokenizers.trainers.Trainer`], built as
following:

<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START init_tokenizer_trainer",
"end-before": "END init_tokenizer_trainer",
"dedent": 8}
</literalinclude>

This tokenizer is based on the [`~tokenizers.models.Unigram`] model. It
takes care of normalizing the input using the NFKC Unicode normalization
method, and uses a [`~tokenizers.pre_tokenizers.ByteLevel`] pre-tokenizer with the corresponding decoder.

For more information on the components used here, you can check
[here](components.html)

## The most basic way

As you probably guessed already, the easiest way to train our tokenizer
is by using a `List`{.interpreted-text role="obj"}:

<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START train_basic",
"end-before": "END train_basic",
"dedent": 8}
</literalinclude>

Easy, right? You can use anything working as an iterator here, be it a
`List`{.interpreted-text role="obj"}, `Tuple`{.interpreted-text
role="obj"}, or a `np.Array`{.interpreted-text role="obj"}. Anything
works as long as it provides strings.

## Using the 🤗 Datasets library

An awesome way to access one of the many datasets that exist out there
is by using the 🤗 Datasets library. For more information about it, you
should check [the official documentation
here](https://huggingface.co/docs/datasets/).

Let's start by loading our dataset:

<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START load_dataset",
"end-before": "END load_dataset",
"dedent": 8}
</literalinclude>

The next step is to build an iterator over this dataset. The easiest way
to do this is probably by using a generator:

<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START def_batch_iterator",
"end-before": "END def_batch_iterator",
"dedent": 8}
</literalinclude>

As you can see here, for improved efficiency we can actually provide a
batch of examples used to train, instead of iterating over them one by
one. By doing so, we can expect performances very similar to those we
got while training directly from files.

With our iterator ready, we just need to launch the training. In order
to improve the look of our progress bars, we can specify the total
length of the dataset:

<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START train_datasets",
"end-before": "END train_datasets",
"dedent": 8}
</literalinclude>

And that's it!

## Using gzip files

Since gzip files in Python can be used as iterators, it is extremely
simple to train on such files:

<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START single_gzip",
"end-before": "END single_gzip",
"dedent": 8}
</literalinclude>

Now if we wanted to train from multiple gzip files, it wouldn't be much
harder:

<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_tutorial_train_from_iterators.py",
"language": "python",
"start-after": "START multi_gzip",
"end-before": "END multi_gzip",
"dedent": 8}
</literalinclude>

And voilà!