mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-23 00:35:35 +00:00
18
README.md
18
README.md
@ -1,22 +1,22 @@
|
|||||||
# Tokenizers
|
# Tokenizers
|
||||||
|
|
||||||
Provides an implementation of today's most used tokenizers, with a focus on performances and
|
Provides an implementation of today's most used tokenizers, with a focus on performance and
|
||||||
versatility.
|
versatility.
|
||||||
|
|
||||||
## What is a Tokenizer
|
## What is a Tokenizer
|
||||||
|
|
||||||
A Tokenizer works as a pipeline, processing some raw text as input, to finally output an
|
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an
|
||||||
`Encoding`.
|
`Encoding`.
|
||||||
The various steps of the pipeline are:
|
The various steps of the pipeline are:
|
||||||
|
|
||||||
1. The `Normalizer` is in charge of normalizing the text. Common examples of Normalization are
|
1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
|
||||||
the unicode normalization standards, such as `NFD` or `NFKC`.
|
the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
|
||||||
2. The `PreTokenizer` is in charge of splitting the text as relevant. The most common way of
|
2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
|
||||||
splitting text is simply on whitespaces, to manipulate words.
|
splitting text is simply on whitespace.
|
||||||
3. The `Model` is in charge of doing the actual tokenization. An example of `Model` would be
|
3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
|
||||||
`BPE` or `WordPiece`.
|
`BPE` or `WordPiece`.
|
||||||
4. The `PostProcessor` is in charge of post processing the `Encoding`, to add anything relevant
|
4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
|
||||||
that a language model would need, like special tokens.
|
that, for example, a language model would need, such as special tokens.
|
||||||
|
|
||||||
## Bindings
|
## Bindings
|
||||||
|
|
||||||
|
@ -2,27 +2,25 @@
|
|||||||
#![doc(html_favicon_url = "https://huggingface.co/favicon.ico")]
|
#![doc(html_favicon_url = "https://huggingface.co/favicon.ico")]
|
||||||
#![doc(html_logo_url = "https://huggingface.co/landing/assets/huggingface_logo.svg")]
|
#![doc(html_logo_url = "https://huggingface.co/landing/assets/huggingface_logo.svg")]
|
||||||
|
|
||||||
//!
|
|
||||||
//! # Tokenizers
|
//! # Tokenizers
|
||||||
//!
|
//!
|
||||||
//! Provides an implementation of today's most used tokenizers, with a focus on performances and
|
//! Provides an implementation of today's most used tokenizers, with a focus on performance and
|
||||||
//! versatility.
|
//! versatility.
|
||||||
//!
|
//!
|
||||||
//! ## What is a Tokenizer
|
//! ## What is a Tokenizer
|
||||||
//!
|
//!
|
||||||
//! A Tokenizer works as a pipeline, processing some raw text as input, to finally output an
|
//! A Tokenizer works as a pipeline, it processes some raw text as input and outputs an
|
||||||
//! `Encoding`.
|
//! `Encoding`.
|
||||||
//! The various steps of the pipeline are:
|
//! The various steps of the pipeline are:
|
||||||
//!
|
//!
|
||||||
//! 1. The `Normalizer` is in charge of normalizing the text. Common examples of Normalization are
|
//! 1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
|
||||||
//! the unicode normalization standards, such as `NFD` or `NFKC`.
|
//! the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
|
||||||
//! 2. The `PreTokenizer` is in charge of splitting the text as relevant. The most common way of
|
//! 2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
|
||||||
//! splitting text is simply on whitespaces, to manipulate words.
|
//! splitting text is simply on whitespace.
|
||||||
//! 3. The `Model` is in charge of doing the actual tokenization. An example of `Model` would be
|
//! 3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
|
||||||
//! `BPE` or `WordPiece`.
|
//! `BPE` or `WordPiece`.
|
||||||
//! 4. The `PostProcessor` is in charge of post processing the `Encoding`, to add anything relevant
|
//! 4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
|
||||||
//! that a language model would need, like special tokens.
|
//! that, for example, a language model would need, such as special tokens.
|
||||||
//!
|
|
||||||
|
|
||||||
#[macro_use]
|
#[macro_use]
|
||||||
extern crate lazy_static;
|
extern crate lazy_static;
|
||||||
|
Reference in New Issue
Block a user