mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
96 lines
2.8 KiB
ReStructuredText
96 lines
2.8 KiB
ReStructuredText
.. tokenizers documentation master file, created by
|
|
sphinx-quickstart on Fri Sep 25 14:32:54 2020.
|
|
You can adapt this file completely to your liking, but it should at least
|
|
contain the root `toctree` directive.
|
|
|
|
Tokenizers
|
|
======================================
|
|
|
|
Fast State-of-the-art tokenizers, optimized for both research and production
|
|
|
|
`🤗 Tokenizers`_ provides an implementation of today's most used tokenizers, with
|
|
a focus on performance and versatility. These tokenizers are also used in
|
|
`🤗 Transformers`_.
|
|
|
|
.. _🤗 Tokenizers: https://github.com/huggingface/tokenizers
|
|
.. _🤗 Transformers: https://github.com/huggingface/transformers
|
|
|
|
Main features:
|
|
--------------
|
|
|
|
- Train new vocabularies and tokenize, using today's most used tokenizers.
|
|
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
|
|
less than 20 seconds to tokenize a GB of text on a server's CPU.
|
|
- Easy to use, but also extremely versatile.
|
|
- Designed for both research and production.
|
|
- Full alignment tracking. Even with destructive normalization, it's always possible to get
|
|
the part of the original sentence that corresponds to any token.
|
|
- Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.
|
|
|
|
Components:
|
|
------------
|
|
|
|
.. toctree::
|
|
:maxdepth: 2
|
|
|
|
tokenizer_blocks
|
|
|
|
Load an existing tokenizer:
|
|
---------------------------
|
|
|
|
Loading a previously saved tokenizer is extremely simple and requires a single line of code:
|
|
|
|
.. only:: rust
|
|
|
|
.. literalinclude:: ../../tokenizers/tests/documentation.rs
|
|
:language: rust
|
|
:start-after: START load_tokenizer
|
|
:end-before: END load_tokenizer
|
|
:dedent: 4
|
|
|
|
.. only:: python
|
|
|
|
.. literalinclude:: ../../bindings/python/tests/documentation/test_load.py
|
|
:language: python
|
|
:start-after: START load_tokenizer
|
|
:end-before: END load_tokenizer
|
|
:dedent: 4
|
|
|
|
.. only:: node
|
|
|
|
.. literalinclude:: ../../bindings/node/examples/load.test.js
|
|
:language: javascript
|
|
:start-after: START load
|
|
:end-before: END load
|
|
:dedent: 4
|
|
|
|
|
|
Train a tokenizer:
|
|
------------------
|
|
|
|
Small guide of :ref:`how to create a Tokenizer options<tokenizer_blocks>`.
|
|
|
|
.. only:: rust
|
|
|
|
.. literalinclude:: ../../tokenizers/tests/documentation.rs
|
|
:language: rust
|
|
:start-after: START train_tokenizer
|
|
:end-before: END train_tokenizer
|
|
:dedent: 4
|
|
|
|
.. only:: python
|
|
|
|
.. literalinclude:: ../../bindings/python/tests/documentation/test_train.py
|
|
:language: python
|
|
:start-after: START train_tokenizer
|
|
:end-before: END train_tokenizer
|
|
:dedent: 4
|
|
|
|
.. only:: node
|
|
|
|
.. literalinclude:: ../../bindings/node/examples/train.test.js
|
|
:language: javascript
|
|
:start-after: START train_tokenizer
|
|
:end-before: END train_tokenizer
|
|
:dedent: 4
|