mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
20 lines
988 B
Plaintext
20 lines
988 B
Plaintext
<!-- DISABLE-FRONTMATTER-SECTIONS -->
|
|
|
|
# Tokenizers
|
|
|
|
Fast State-of-the-art tokenizers, optimized for both research and
|
|
production
|
|
|
|
[🤗 Tokenizers](https://github.com/huggingface/tokenizers) provides an
|
|
implementation of today's most used tokenizers, with a focus on
|
|
performance and versatility. These tokenizers are also used in [🤗 Transformers](https://github.com/huggingface/transformers).
|
|
|
|
# Main features:
|
|
|
|
- Train new vocabularies and tokenize, using today's most used tokenizers.
|
|
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
|
|
- Easy to use, but also extremely versatile.
|
|
- Designed for both research and production.
|
|
- Full alignment tracking. Even with destructive normalization, it's always possible to get the part of the original sentence that corresponds to any token.
|
|
- Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.
|