Quick README update

2025-08-22 16:25:30 +00:00 · 2020-01-08 14:07:48 -05:00
parent 988159a998
commit bbe31f9237
2 changed files with 85 additions and 16 deletions
--- a/README.md
+++ b/README.md
@ -5,8 +5,7 @@ versatility.

 ## What is a Tokenizer

-A Tokenizer works as a pipeline, it processes some raw text as input and outputs an
-`Encoding`.
+A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`.
 The various steps of the pipeline are:

 1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
@ -18,6 +17,17 @@ The various steps of the pipeline are:
 4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
   that, for example, a language model would need, such as special tokens.

+## Main features:
+
+ - Train new vocabularies and tokenize, using todays most used tokenizers.
+ - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
+   less than 20 seconds to tokenize a GB of text on a server's CPU.
+ - Easy to use, but also extremely versatile.
+ - Designed for research and production.
+ - Normalization comes with alignments tracking. It's always possible to get the part of the
+   original sentence that corresponds to a given token.
+ - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
+
 ## Bindings

 We provide bindings to the following languages (more to come!):
--- a/bindings/python/README.md
+++ b/bindings/python/README.md
@ -2,12 +2,25 @@

 # Tokenizers

-A fast and easy to use implementation of today's most used tokenizers.
+Provides an implementation of today's most used tokenizers, with a focus on performance and
+versatility.

- - High Level design: [master](https://github.com/huggingface/tokenizers)
+Bindings over the [Rust](https://github.com/huggingface/tokenizers) implementation.
+If you are interested in the High-level design, you can go check it there.

-This API is currently in the process of being stabilized. We might introduce breaking changes
-really often in the coming days/weeks, so use at your own risks.
+Otherwise, let's dive in!
+
+## Main features:
+
+ - Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3
+   most common BPE versions).
+ - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
+   less than 20 seconds to tokenize a GB of text on a server's CPU.
+ - Easy to use, but also extremely versatile.
+ - Designed for research and production.
+ - Normalization comes with alignments tracking. It's always possible to get the part of the
+   original sentence that corresponds to a given token.
+ - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

 ### Installation

@ -19,18 +32,15 @@ pip install tokenizers

 #### From sources:

-To use this method, you need to have the Rust nightly toolchain installed.
+To use this method, you need to have the Rust installed:

 ```bash
 # Install with:
-curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
+curl https://sh.rustup.rs -sSf | sh -s -- -y
 export PATH="$HOME/.cargo/bin:$PATH"
-
-# Or select the right toolchain:
-rustup default nightly-2019-11-01
 ```

-Once Rust is installed and using the right toolchain you can do the following.
+Once Rust is installed, you can compile doing the following

 ```bash
 git clone https://github.com/huggingface/tokenizers
@ -41,11 +51,59 @@ python -m venv .env
 source .env/bin/activate

 # Install `tokenizers` in the current virtual env
-pip install maturin
-maturin develop --release
+pip install setuptools_rust
+python setup.py install
 ```

-### Usage
+### Using the provided Tokenizers
+
+Using a pre-trained tokenizer is really simple:
+
+```python
+from tokenizers import BPETokenizer
+
+# Initialize a tokenizer
+vocab = "./path/to/vocab.json"
+merges = "./path/to/merges.txt"
+tokenizer = BPETokenizer(vocab, merges)
+
+# And then encode:
+encoded = tokenizer.encode("I can feel the magic, can you?")
+print(encoded.ids)
+print(encoded.tokens)
+```
+
+And you can train yours just as simply:
+
+```python
+from tokenizers import BPETokenizer
+
+# Initialize a tokenizer
+tokenizer = BPETokenizer()
+
+# Then train it!
+tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])
+
+# And you can use it
+encoded = tokenizer.encode("I can feel the magic, can you?")
+
+# And finally save it somewhere
+tokenizer.save("./path/to/directory", "my-bpe")
+```
+
+### Provided Tokenizers
+
+ - `BPETokenizer`: The original BPE
+ - `ByteLevelBPETokenizer`: The byte level version of the BPE
+ - `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece
+ - `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece
+
+All of these can be used and trained as explained above!
+
+### Build your own
+
+You can also easily build your own tokenizers, by putting all the different parts
+you need together:

 #### Use a pre-trained tokenizer

@ -66,7 +124,8 @@ tokenizer.decoder = decoders.ByteLevel.new()

 # And then encode:
 encoded = tokenizer.encode("I can feel the magic, can you?")
-print(encoded)
+print(encoded.ids)
+print(encoded.tokens)

 # Or tokenize multiple sentences at once:
 encoded = tokenizer.encode_batch([