Update docs for from_pretrained

This commit is contained in:
Anthony Moi
2021-08-19 16:50:17 +02:00
committed by Anthony MOI
parent 528c9a532e
commit a4d0f3dd18
2 changed files with 16 additions and 8 deletions

View File

@ -3,6 +3,7 @@ from setuptools_rust import Binding, RustExtension
extras = {}
extras["testing"] = ["pytest", "requests", "numpy", "datasets"]
extras["docs"] = ["sphinx", "sphinx_rtd_theme", "setuptools_rust"]
setup(
name="tokenizers",

View File

@ -706,10 +706,22 @@ In this case, the `attention mask` generated by the tokenizer takes the padding
.. only:: python
Using a pretrained tokenizer
----------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
You can also use a pretrained tokenizer directly in, as long as you have its vocabulary file. For
instance, here is how to get the classic pretrained BERT tokenizer:
You can load any tokenizer from the Hugging Face Hub as long as a `tokenizer.json` file is
available in the repository.
.. code-block:: python
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
Importing a pretrained tokenizer from legacy vocabulary files
------------------------------------------------------------------------------------------------
You can also import a pretrained tokenizer directly in, as long as you have its vocabulary file.
For instance, here is how to import the classic pretrained BERT tokenizer:
.. code-block:: python
@ -722,8 +734,3 @@ In this case, the `attention mask` generated by the tokenizer takes the padding
.. code-block:: bash
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
.. note::
Better support for pretrained tokenizers is coming in a next release, so expect this API to
change soon.