mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Update docs for from_pretrained
This commit is contained in:
@ -3,6 +3,7 @@ from setuptools_rust import Binding, RustExtension
|
|||||||
|
|
||||||
extras = {}
|
extras = {}
|
||||||
extras["testing"] = ["pytest", "requests", "numpy", "datasets"]
|
extras["testing"] = ["pytest", "requests", "numpy", "datasets"]
|
||||||
|
extras["docs"] = ["sphinx", "sphinx_rtd_theme", "setuptools_rust"]
|
||||||
|
|
||||||
setup(
|
setup(
|
||||||
name="tokenizers",
|
name="tokenizers",
|
||||||
|
@ -706,10 +706,22 @@ In this case, the `attention mask` generated by the tokenizer takes the padding
|
|||||||
.. only:: python
|
.. only:: python
|
||||||
|
|
||||||
Using a pretrained tokenizer
|
Using a pretrained tokenizer
|
||||||
----------------------------------------------------------------------------------------------------
|
------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
You can also use a pretrained tokenizer directly in, as long as you have its vocabulary file. For
|
You can load any tokenizer from the Hugging Face Hub as long as a `tokenizer.json` file is
|
||||||
instance, here is how to get the classic pretrained BERT tokenizer:
|
available in the repository.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from tokenizers import Tokenizer
|
||||||
|
|
||||||
|
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
|
||||||
|
|
||||||
|
Importing a pretrained tokenizer from legacy vocabulary files
|
||||||
|
------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
You can also import a pretrained tokenizer directly in, as long as you have its vocabulary file.
|
||||||
|
For instance, here is how to import the classic pretrained BERT tokenizer:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
@ -722,8 +734,3 @@ In this case, the `attention mask` generated by the tokenizer takes the padding
|
|||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
|
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
|
||||||
|
|
||||||
.. note::
|
|
||||||
|
|
||||||
Better support for pretrained tokenizers is coming in a next release, so expect this API to
|
|
||||||
change soon.
|
|
||||||
|
Reference in New Issue
Block a user