Quick README update

This commit is contained in:
Anthony MOI
2020-01-08 14:07:48 -05:00
parent 988159a998
commit bbe31f9237
2 changed files with 85 additions and 16 deletions

View File

@ -5,8 +5,7 @@ versatility.
## What is a Tokenizer
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an
`Encoding`.
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`.
The various steps of the pipeline are:
1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
@ -18,6 +17,17 @@ The various steps of the pipeline are:
4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
that, for example, a language model would need, such as special tokens.
## Main features:
- Train new vocabularies and tokenize, using todays most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible to get the part of the
original sentence that corresponds to a given token.
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
## Bindings
We provide bindings to the following languages (more to come!):

View File

@ -2,12 +2,25 @@
# Tokenizers
A fast and easy to use implementation of today's most used tokenizers.
Provides an implementation of today's most used tokenizers, with a focus on performance and
versatility.
- High Level design: [master](https://github.com/huggingface/tokenizers)
Bindings over the [Rust](https://github.com/huggingface/tokenizers) implementation.
If you are interested in the High-level design, you can go check it there.
This API is currently in the process of being stabilized. We might introduce breaking changes
really often in the coming days/weeks, so use at your own risks.
Otherwise, let's dive in!
## Main features:
- Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3
most common BPE versions).
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible to get the part of the
original sentence that corresponds to a given token.
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
### Installation
@ -19,18 +32,15 @@ pip install tokenizers
#### From sources:
To use this method, you need to have the Rust nightly toolchain installed.
To use this method, you need to have the Rust installed:
```bash
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"
# Or select the right toolchain:
rustup default nightly-2019-11-01
```
Once Rust is installed and using the right toolchain you can do the following.
Once Rust is installed, you can compile doing the following
```bash
git clone https://github.com/huggingface/tokenizers
@ -41,11 +51,59 @@ python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release
pip install setuptools_rust
python setup.py install
```
### Usage
### Using the provided Tokenizers
Using a pre-trained tokenizer is really simple:
```python
from tokenizers import BPETokenizer
# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = BPETokenizer(vocab, merges)
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)
```
And you can train yours just as simply:
```python
from tokenizers import BPETokenizer
# Initialize a tokenizer
tokenizer = BPETokenizer()
# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])
# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")
# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")
```
### Provided Tokenizers
- `BPETokenizer`: The original BPE
- `ByteLevelBPETokenizer`: The byte level version of the BPE
- `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece
- `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece
All of these can be used and trained as explained above!
### Build your own
You can also easily build your own tokenizers, by putting all the different parts
you need together:
#### Use a pre-trained tokenizer
@ -66,7 +124,8 @@ tokenizer.decoder = decoders.ByteLevel.new()
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
print(encoded.ids)
print(encoded.tokens)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([