Files
tokenizers/docs/source/index.rst
2020-11-02 17:07:27 -05:00

71 lines
2.1 KiB
ReStructuredText

Tokenizers
====================================================================================================
Fast State-of-the-art tokenizers, optimized for both research and production
`🤗 Tokenizers`_ provides an implementation of today's most used tokenizers, with
a focus on performance and versatility. These tokenizers are also used in
`🤗 Transformers`_.
.. _🤗 Tokenizers: https://github.com/huggingface/tokenizers
.. _🤗 Transformers: https://github.com/huggingface/transformers
Main features:
----------------------------------------------------------------------------------------------------
- Train new vocabularies and tokenize, using today's most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for both research and production.
- Full alignment tracking. Even with destructive normalization, it's always possible to get
the part of the original sentence that corresponds to any token.
- Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.
.. toctree::
:maxdepth: 2
:caption: Getting Started
quicktour
installation/main
pipeline
components
.. toctree::
:maxdepth: 3
:caption: API Reference
api/reference
.. entities:: python
:global:
class
class
classmethod
class method
Tokenizer
:class:`~tokenizers.Tokenizer`
Tokenizer.train
:meth:`~tokenizers.Tokenizer.train`
Tokenizer.save
:meth:`~tokenizers.Tokenizer.save`
Tokenizer.from_file
:meth:`~tokenizers.Tokenizer.from_file`
.. entities:: rust
:global:
class
struct
classmethod
static method
Tokenizer
`Tokenizer <https://docs.rs/tokenizers/latest/tokenizers/tokenizer/struct.Tokenizer.html>`__
Tokenizer.train
`train <https://docs.rs/tokenizers/0.10.1/tokenizers/tokenizer/struct.Tokenizer.html#method.train>`__