small wording changes (#39)

* small wording changes

* fix formatting
This commit is contained in:
Evan Pete Walsh
2020-01-07 05:33:59 -08:00
committed by GitHub
parent b06681cb1e
commit 49a67824ce
2 changed files with 18 additions and 20 deletions

View File

@ -1,22 +1,22 @@
# Tokenizers # Tokenizers
Provides an implementation of today's most used tokenizers, with a focus on performances and Provides an implementation of today's most used tokenizers, with a focus on performance and
versatility. versatility.
## What is a Tokenizer ## What is a Tokenizer
A Tokenizer works as a pipeline, processing some raw text as input, to finally output an A Tokenizer works as a pipeline, it processes some raw text as input and outputs an
`Encoding`. `Encoding`.
The various steps of the pipeline are: The various steps of the pipeline are:
1. The `Normalizer` is in charge of normalizing the text. Common examples of Normalization are 1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
the unicode normalization standards, such as `NFD` or `NFKC`. the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
2. The `PreTokenizer` is in charge of splitting the text as relevant. The most common way of 2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
splitting text is simply on whitespaces, to manipulate words. splitting text is simply on whitespace.
3. The `Model` is in charge of doing the actual tokenization. An example of `Model` would be 3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
`BPE` or `WordPiece`. `BPE` or `WordPiece`.
4. The `PostProcessor` is in charge of post processing the `Encoding`, to add anything relevant 4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
that a language model would need, like special tokens. that, for example, a language model would need, such as special tokens.
## Bindings ## Bindings

View File

@ -2,27 +2,25 @@
#![doc(html_favicon_url = "https://huggingface.co/favicon.ico")] #![doc(html_favicon_url = "https://huggingface.co/favicon.ico")]
#![doc(html_logo_url = "https://huggingface.co/landing/assets/huggingface_logo.svg")] #![doc(html_logo_url = "https://huggingface.co/landing/assets/huggingface_logo.svg")]
//!
//! # Tokenizers //! # Tokenizers
//! //!
//! Provides an implementation of today's most used tokenizers, with a focus on performances and //! Provides an implementation of today's most used tokenizers, with a focus on performance and
//! versatility. //! versatility.
//! //!
//! ## What is a Tokenizer //! ## What is a Tokenizer
//! //!
//! A Tokenizer works as a pipeline, processing some raw text as input, to finally output an //! A Tokenizer works as a pipeline, it processes some raw text as input and outputs an
//! `Encoding`. //! `Encoding`.
//! The various steps of the pipeline are: //! The various steps of the pipeline are:
//! //!
//! 1. The `Normalizer` is in charge of normalizing the text. Common examples of Normalization are //! 1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
//! the unicode normalization standards, such as `NFD` or `NFKC`. //! the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
//! 2. The `PreTokenizer` is in charge of splitting the text as relevant. The most common way of //! 2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
//! splitting text is simply on whitespaces, to manipulate words. //! splitting text is simply on whitespace.
//! 3. The `Model` is in charge of doing the actual tokenization. An example of `Model` would be //! 3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
//! `BPE` or `WordPiece`. //! `BPE` or `WordPiece`.
//! 4. The `PostProcessor` is in charge of post processing the `Encoding`, to add anything relevant //! 4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
//! that a language model would need, like special tokens. //! that, for example, a language model would need, such as special tokens.
//!
#[macro_use] #[macro_use]
extern crate lazy_static; extern crate lazy_static;