tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Go to file

Chris Ha cefc41e8ec implement a simple max_sentencepiece_length into BPE (#1228 )

* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* implement a simple max_sentencepiece_length into BPE

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

* utilize Option<u16> for safer code.

* Other version.

* Update trainer.rs

clarify with type usize propagate max_length option

* change max_length into more descriptive name

in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.

* change variable name in trainer.rs

change max_merge_length into max_token_length

* Update trainer.rs

add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer

Add explanation for variable shadowing.

* Update trainer.rs

Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.

* add max_token_length test

* Add bpe direct assert test

* Update trainer.rs

clarified test documentation

* Creating the bindings.

* Fix the default.

* Re-adding missing package-lock which I accidentally removed.

* ..

* Fixing trainer test.

* Fix.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

2023-05-16 10:08:19 +02:00

.github

Rvert main hiccup.

2023-05-15 18:01:29 +02:00

bindings

implement a simple max_sentencepiece_length into BPE (#1228 )

2023-05-16 10:08:19 +02:00

docs

Fix one char super tiny typo (#1137 )

2022-12-26 11:13:38 +01:00

tokenizers

implement a simple max_sentencepiece_length into BPE (#1228 )

2023-05-16 10:08:19 +02:00

.gitignore

Rvert main hiccup.

2023-05-15 18:01:29 +02:00

LICENSE

Create LICENSE

2020-01-04 23:31:02 -05:00

README.md

Adding a link to the Ruby port of tokenizers (#961 )

2022-03-24 17:09:30 +01:00

RELEASE.md

Adding a new document that is the checklist to make (#975 )

2022-04-12 14:18:09 +02:00

README.md

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

Train new vocabularies and tokenize, using today's most used tokenizers.
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Bindings

We provide bindings to the following languages (more to come!):

Rust (Original implementation)
Python
Node.js
Ruby (Contributed by @ankane, external repo)

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())

You can customize how pre-tokenization (e.g., splitting into words) is done:

from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

Then training your tokenizer on a set of files just takes two lines of codes:

from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)

Once your tokenizer is trained, encode any text with just one line:

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

Check the python documentation or the python quicktour to learn more!

Languages

Rust 72.3%

Python 20%

Jupyter Notebook 4.5%

TypeScript 2.3%

JavaScript 0.4%

Other 0.5%