mirror of https://github.com/mii443/tokenizers.git synced 2025-09-03 15:59:25 +00:00

Files

Nicolas Patry fff856cff7 New PR to fix #270 (not #157 ). (#516 )

* New PR to fix #270 (not #157).

Reduce drastically the number of required compilation flags.
I think it's good enough for merge right now. We disable progress
altogether when the `progressbar` flag is disabled which is perfectly
fine compared to not being able to build.

Future PR could include.

- Better encapsulation of `progress` in training call sites (less direct
calls to `indicatif` and common code for `setup_progress`, `finalize`
and so on.
- We can have a raw `print` Progress bar when compilation flag is
disabled ?
- Having better control of progressbars in bindings would require use to
change a bunch of code around which might be overkill in the short term.
Either we start by defining a trait for our ProgressBar, and the
bindings can implement the traits with custom `tqdm` and `cli-progress`
(It's not even 100% sure it's doable)
- The easiest way would be to enable some sort of iterator in Rust
  so that calling of progressbars can happen in client code which would
  be the most lenient for all plateforms. The hard part is that
leveraging parallelism in that setting would be hard probably.

* Remove external visibility of progressbar.

* Remove dead import.

2020-11-11 10:51:27 +01:00

examples

Remove unwanted file

2020-09-24 14:05:47 -04:00

py_src/tokenizers

Python - Update CHANGELOG and bump version for 0.9.4

2020-11-09 16:36:04 -05:00

scripts

Removed now wrong code in convert.py, fixed strange black magic.

2020-09-24 08:57:02 +02:00

src

words -> word_ids & sequences -> sequence_ids

2020-11-09 16:02:07 -05:00

tests

words -> word_ids & sequences -> sequence_ids

2020-11-09 16:02:07 -05:00

.gitignore

Doc - Improve snippets testing

2020-11-02 17:07:27 -05:00

build-sdist.sh

Python - build-sdist.sh +x mode

2020-01-06 14:24:08 -05:00

build-wheels.sh

Moving to manylinux2010 and remove nightly on Windows. (#455 )

2020-11-09 23:23:07 -05:00

Cargo.lock

New PR to fix #270 (not #157 ). (#516 )

2020-11-11 10:51:27 +01:00

Cargo.toml

Python - Update CHANGELOG and bump version for 0.9.4

2020-11-09 16:36:04 -05:00

CHANGELOG.md

Python - Update CHANGELOG and bump version for 0.9.4

2020-11-09 16:36:04 -05:00

conftest.py

Removing --release compat test.

2020-09-02 13:38:14 -04:00

Makefile

Attempt to get some documentation going.

2020-11-02 17:07:27 -05:00

MANIFEST.in

Include license in PyPI package

2020-07-16 14:20:32 -04:00

pyproject.toml

Setup black format in pyproject.toml

2020-09-23 11:58:35 -04:00

README.md

Fixed Dead Link: Build your own #435 (#436 )

2020-09-25 09:41:31 -04:00

rust-toolchain

Python - Use 1.46.0 for now

2020-10-09 13:40:35 -04:00

setup.py

Python - Update CHANGELOG and bump version for 0.9.4

2020-11-09 16:36:04 -05:00

README.md

Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

CharBPETokenizer: The original BPE
ByteLevelBPETokenizer: The byte level version of the BPE
SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")