mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Update docs for from_pretrained
This commit is contained in:
@ -706,10 +706,22 @@ In this case, the `attention mask` generated by the tokenizer takes the padding
|
||||
.. only:: python
|
||||
|
||||
Using a pretrained tokenizer
|
||||
----------------------------------------------------------------------------------------------------
|
||||
------------------------------------------------------------------------------------------------
|
||||
|
||||
You can also use a pretrained tokenizer directly in, as long as you have its vocabulary file. For
|
||||
instance, here is how to get the classic pretrained BERT tokenizer:
|
||||
You can load any tokenizer from the Hugging Face Hub as long as a `tokenizer.json` file is
|
||||
available in the repository.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from tokenizers import Tokenizer
|
||||
|
||||
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
|
||||
|
||||
Importing a pretrained tokenizer from legacy vocabulary files
|
||||
------------------------------------------------------------------------------------------------
|
||||
|
||||
You can also import a pretrained tokenizer directly in, as long as you have its vocabulary file.
|
||||
For instance, here is how to import the classic pretrained BERT tokenizer:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@ -722,8 +734,3 @@ In this case, the `attention mask` generated by the tokenizer takes the padding
|
||||
.. code-block:: bash
|
||||
|
||||
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
|
||||
|
||||
.. note::
|
||||
|
||||
Better support for pretrained tokenizers is coming in a next release, so expect this API to
|
||||
change soon.
|
||||
|
Reference in New Issue
Block a user