diff --git a/README.md b/README.md index 96c8da21..f9dfafb3 100644 --- a/README.md +++ b/README.md @@ -8,15 +8,15 @@ vocabulary, and then process some text either in real time or in advance. A Tokenizer works as a pipeline taking some raw text as input, going through multiple steps to finally output a list of `Token`s. The various steps of the pipeline are: - - Some optional `Normalizer`s. An example would be a Unicode normalization step. They take - some raw text as input, and also output raw text `String`. - - An optional `PreTokenizer` which should take some raw text and take care of spliting - as relevant, and pre-processing tokens if needed. Takes a raw text `String` as input, and - outputs a `Vec`. - - A `Model` to do the actual tokenization. An example of `Model` would be `BPE`. Takes - a `Vec` as input, and gives a `Vec`. - - Some optional `PostProcessor`s. These are in charge of post processing the list of `Token`s - in any relevant way. This includes truncating, adding some padding, ... +- Some optional `Normalizer`s. An example would be a Unicode normalization step. They take +some raw text as input, and also output raw text `String`. +- An optional `PreTokenizer` which should take some raw text and take care of spliting +as relevant, and pre-processing tokens if needed. Takes a raw text `String` as input, and +outputs a `Vec`. +- A `Model` to do the actual tokenization. An example of `Model` would be `BPE`. Takes +a `Vec` as input, and gives a `Vec`. +- Some optional `PostProcessor`s. These are in charge of post processing the list of `Token`s +in any relevant way. This includes truncating, adding some padding, ... ## Try the shell