small wording changes (#39)

* small wording changes

* fix formatting
This commit is contained in:
Evan Pete Walsh
2020-01-07 05:33:59 -08:00
committed by GitHub
parent b06681cb1e
commit 49a67824ce
2 changed files with 18 additions and 20 deletions

View File

@ -1,22 +1,22 @@
# Tokenizers
Provides an implementation of today's most used tokenizers, with a focus on performances and
Provides an implementation of today's most used tokenizers, with a focus on performance and
versatility.
## What is a Tokenizer
A Tokenizer works as a pipeline, processing some raw text as input, to finally output an
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an
`Encoding`.
The various steps of the pipeline are:
1. The `Normalizer` is in charge of normalizing the text. Common examples of Normalization are
the unicode normalization standards, such as `NFD` or `NFKC`.
2. The `PreTokenizer` is in charge of splitting the text as relevant. The most common way of
splitting text is simply on whitespaces, to manipulate words.
3. The `Model` is in charge of doing the actual tokenization. An example of `Model` would be
1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
splitting text is simply on whitespace.
3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
`BPE` or `WordPiece`.
4. The `PostProcessor` is in charge of post processing the `Encoding`, to add anything relevant
that a language model would need, like special tokens.
4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
that, for example, a language model would need, such as special tokens.
## Bindings

View File

@ -2,27 +2,25 @@
#![doc(html_favicon_url = "https://huggingface.co/favicon.ico")]
#![doc(html_logo_url = "https://huggingface.co/landing/assets/huggingface_logo.svg")]
//!
//! # Tokenizers
//!
//! Provides an implementation of today's most used tokenizers, with a focus on performances and
//! Provides an implementation of today's most used tokenizers, with a focus on performance and
//! versatility.
//!
//! ## What is a Tokenizer
//!
//! A Tokenizer works as a pipeline, processing some raw text as input, to finally output an
//! A Tokenizer works as a pipeline, it processes some raw text as input and outputs an
//! `Encoding`.
//! The various steps of the pipeline are:
//!
//! 1. The `Normalizer` is in charge of normalizing the text. Common examples of Normalization are
//! the unicode normalization standards, such as `NFD` or `NFKC`.
//! 2. The `PreTokenizer` is in charge of splitting the text as relevant. The most common way of
//! splitting text is simply on whitespaces, to manipulate words.
//! 3. The `Model` is in charge of doing the actual tokenization. An example of `Model` would be
//! 1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
//! the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`.
//! 2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of
//! splitting text is simply on whitespace.
//! 3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be
//! `BPE` or `WordPiece`.
//! 4. The `PostProcessor` is in charge of post processing the `Encoding`, to add anything relevant
//! that a language model would need, like special tokens.
//!
//! 4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
//! that, for example, a language model would need, such as special tokens.
#[macro_use]
extern crate lazy_static;