Files
tokenizers/docs/source/index.rst
2020-11-02 17:07:27 -05:00

96 lines
2.8 KiB
ReStructuredText

.. tokenizers documentation master file, created by
sphinx-quickstart on Fri Sep 25 14:32:54 2020.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Tokenizers
======================================
Fast State-of-the-art tokenizers, optimized for both research and production
`🤗 Tokenizers`_ provides an implementation of today's most used tokenizers, with
a focus on performance and versatility. These tokenizers are also used in
`🤗 Transformers`_.
.. _🤗 Tokenizers: https://github.com/huggingface/tokenizers
.. _🤗 Transformers: https://github.com/huggingface/transformers
Main features:
--------------
- Train new vocabularies and tokenize, using today's most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for both research and production.
- Full alignment tracking. Even with destructive normalization, it's always possible to get
the part of the original sentence that corresponds to any token.
- Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.
Components:
------------
.. toctree::
:maxdepth: 2
tokenizer_blocks
Load an existing tokenizer:
---------------------------
Loading a previously saved tokenizer is extremely simple and requires a single line of code:
.. only:: rust
.. literalinclude:: ../../tokenizers/tests/documentation.rs
:language: rust
:start-after: START load_tokenizer
:end-before: END load_tokenizer
:dedent: 4
.. only:: python
.. literalinclude:: ../../bindings/python/tests/documentation/test_load.py
:language: python
:start-after: START load_tokenizer
:end-before: END load_tokenizer
:dedent: 4
.. only:: node
.. literalinclude:: ../../bindings/node/examples/load.test.js
:language: javascript
:start-after: START load
:end-before: END load
:dedent: 4
Train a tokenizer:
------------------
Small guide of :ref:`how to create a Tokenizer options<tokenizer_blocks>`.
.. only:: rust
.. literalinclude:: ../../tokenizers/tests/documentation.rs
:language: rust
:start-after: START train_tokenizer
:end-before: END train_tokenizer
:dedent: 4
.. only:: python
.. literalinclude:: ../../bindings/python/tests/documentation/test_train.py
:language: python
:start-after: START train_tokenizer
:end-before: END train_tokenizer
:dedent: 4
.. only:: node
.. literalinclude:: ../../bindings/node/examples/train.test.js
:language: javascript
:start-after: START train_tokenizer
:end-before: END train_tokenizer
:dedent: 4