.. tokenizers documentation master file, created by sphinx-quickstart on Fri Sep 25 14:32:54 2020. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Tokenizers ====================================== Fast State-of-the-art tokenizers, optimized for both research and production `🤗 Tokenizers`_ provides an implementation of today's most used tokenizers, with a focus on performance and versatility. These tokenizers are also used in `🤗 Transformers`_. .. _🤗 Tokenizers: https://github.com/huggingface/tokenizers .. _🤗 Transformers: https://github.com/huggingface/transformers Main features: -------------- - Train new vocabularies and tokenize, using today's most used tokenizers. - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU. - Easy to use, but also extremely versatile. - Designed for both research and production. - Full alignment tracking. Even with destructive normalization, it's always possible to get the part of the original sentence that corresponds to any token. - Does all the pre-processing: Truncation, Padding, add the special tokens your model needs. Components: ------------ .. toctree:: :maxdepth: 2 tokenizer_blocks Load an existing tokenizer: --------------------------- Loading a previously saved tokenizer is extremely simple and requires a single line of code: .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START load_tokenizer :end-before: END load_tokenizer :dedent: 4 .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_load.py :language: python :start-after: START load_tokenizer :end-before: END load_tokenizer :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/load.test.js :language: javascript :start-after: START load :end-before: END load :dedent: 4 Train a tokenizer: ------------------ Small guide of :ref:`how to create a Tokenizer options`. .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START train_tokenizer :end-before: END train_tokenizer :dedent: 4 .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_train.py :language: python :start-after: START train_tokenizer :end-before: END train_tokenizer :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/train.test.js :language: javascript :start-after: START train_tokenizer :end-before: END train_tokenizer :dedent: 4