mirror of
https://github.com/mii443/tokenizers.git
synced 2025-09-02 23:39:14 +00:00
Quick README update
This commit is contained in:
14
README.md
14
README.md
@ -5,8 +5,7 @@ versatility.
|
|||||||
|
|
||||||
## What is a Tokenizer
|
## What is a Tokenizer
|
||||||
|
|
||||||
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an
|
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an `Encoding`.
|
||||||
`Encoding`.
|
|
||||||
The various steps of the pipeline are:
|
The various steps of the pipeline are:
|
||||||
|
|
||||||
1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
|
1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are
|
||||||
@ -18,6 +17,17 @@ The various steps of the pipeline are:
|
|||||||
4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
|
4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
|
||||||
that, for example, a language model would need, such as special tokens.
|
that, for example, a language model would need, such as special tokens.
|
||||||
|
|
||||||
|
## Main features:
|
||||||
|
|
||||||
|
- Train new vocabularies and tokenize, using todays most used tokenizers.
|
||||||
|
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
|
||||||
|
less than 20 seconds to tokenize a GB of text on a server's CPU.
|
||||||
|
- Easy to use, but also extremely versatile.
|
||||||
|
- Designed for research and production.
|
||||||
|
- Normalization comes with alignments tracking. It's always possible to get the part of the
|
||||||
|
original sentence that corresponds to a given token.
|
||||||
|
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
|
||||||
|
|
||||||
## Bindings
|
## Bindings
|
||||||
|
|
||||||
We provide bindings to the following languages (more to come!):
|
We provide bindings to the following languages (more to come!):
|
||||||
|
@ -2,12 +2,25 @@
|
|||||||
|
|
||||||
# Tokenizers
|
# Tokenizers
|
||||||
|
|
||||||
A fast and easy to use implementation of today's most used tokenizers.
|
Provides an implementation of today's most used tokenizers, with a focus on performance and
|
||||||
|
versatility.
|
||||||
|
|
||||||
- High Level design: [master](https://github.com/huggingface/tokenizers)
|
Bindings over the [Rust](https://github.com/huggingface/tokenizers) implementation.
|
||||||
|
If you are interested in the High-level design, you can go check it there.
|
||||||
|
|
||||||
This API is currently in the process of being stabilized. We might introduce breaking changes
|
Otherwise, let's dive in!
|
||||||
really often in the coming days/weeks, so use at your own risks.
|
|
||||||
|
## Main features:
|
||||||
|
|
||||||
|
- Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3
|
||||||
|
most common BPE versions).
|
||||||
|
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
|
||||||
|
less than 20 seconds to tokenize a GB of text on a server's CPU.
|
||||||
|
- Easy to use, but also extremely versatile.
|
||||||
|
- Designed for research and production.
|
||||||
|
- Normalization comes with alignments tracking. It's always possible to get the part of the
|
||||||
|
original sentence that corresponds to a given token.
|
||||||
|
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
|
||||||
|
|
||||||
### Installation
|
### Installation
|
||||||
|
|
||||||
@ -19,18 +32,15 @@ pip install tokenizers
|
|||||||
|
|
||||||
#### From sources:
|
#### From sources:
|
||||||
|
|
||||||
To use this method, you need to have the Rust nightly toolchain installed.
|
To use this method, you need to have the Rust installed:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Install with:
|
# Install with:
|
||||||
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
|
curl https://sh.rustup.rs -sSf | sh -s -- -y
|
||||||
export PATH="$HOME/.cargo/bin:$PATH"
|
export PATH="$HOME/.cargo/bin:$PATH"
|
||||||
|
|
||||||
# Or select the right toolchain:
|
|
||||||
rustup default nightly-2019-11-01
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Once Rust is installed and using the right toolchain you can do the following.
|
Once Rust is installed, you can compile doing the following
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git clone https://github.com/huggingface/tokenizers
|
git clone https://github.com/huggingface/tokenizers
|
||||||
@ -41,11 +51,59 @@ python -m venv .env
|
|||||||
source .env/bin/activate
|
source .env/bin/activate
|
||||||
|
|
||||||
# Install `tokenizers` in the current virtual env
|
# Install `tokenizers` in the current virtual env
|
||||||
pip install maturin
|
pip install setuptools_rust
|
||||||
maturin develop --release
|
python setup.py install
|
||||||
```
|
```
|
||||||
|
|
||||||
### Usage
|
### Using the provided Tokenizers
|
||||||
|
|
||||||
|
Using a pre-trained tokenizer is really simple:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from tokenizers import BPETokenizer
|
||||||
|
|
||||||
|
# Initialize a tokenizer
|
||||||
|
vocab = "./path/to/vocab.json"
|
||||||
|
merges = "./path/to/merges.txt"
|
||||||
|
tokenizer = BPETokenizer(vocab, merges)
|
||||||
|
|
||||||
|
# And then encode:
|
||||||
|
encoded = tokenizer.encode("I can feel the magic, can you?")
|
||||||
|
print(encoded.ids)
|
||||||
|
print(encoded.tokens)
|
||||||
|
```
|
||||||
|
|
||||||
|
And you can train yours just as simply:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from tokenizers import BPETokenizer
|
||||||
|
|
||||||
|
# Initialize a tokenizer
|
||||||
|
tokenizer = BPETokenizer()
|
||||||
|
|
||||||
|
# Then train it!
|
||||||
|
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])
|
||||||
|
|
||||||
|
# And you can use it
|
||||||
|
encoded = tokenizer.encode("I can feel the magic, can you?")
|
||||||
|
|
||||||
|
# And finally save it somewhere
|
||||||
|
tokenizer.save("./path/to/directory", "my-bpe")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Provided Tokenizers
|
||||||
|
|
||||||
|
- `BPETokenizer`: The original BPE
|
||||||
|
- `ByteLevelBPETokenizer`: The byte level version of the BPE
|
||||||
|
- `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece
|
||||||
|
- `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece
|
||||||
|
|
||||||
|
All of these can be used and trained as explained above!
|
||||||
|
|
||||||
|
### Build your own
|
||||||
|
|
||||||
|
You can also easily build your own tokenizers, by putting all the different parts
|
||||||
|
you need together:
|
||||||
|
|
||||||
#### Use a pre-trained tokenizer
|
#### Use a pre-trained tokenizer
|
||||||
|
|
||||||
@ -66,7 +124,8 @@ tokenizer.decoder = decoders.ByteLevel.new()
|
|||||||
|
|
||||||
# And then encode:
|
# And then encode:
|
||||||
encoded = tokenizer.encode("I can feel the magic, can you?")
|
encoded = tokenizer.encode("I can feel the magic, can you?")
|
||||||
print(encoded)
|
print(encoded.ids)
|
||||||
|
print(encoded.tokens)
|
||||||
|
|
||||||
# Or tokenize multiple sentences at once:
|
# Or tokenize multiple sentences at once:
|
||||||
encoded = tokenizer.encode_batch([
|
encoded = tokenizer.encode_batch([
|
||||||
|
Reference in New Issue
Block a user