tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-03 11:18:29 +00:00

Go to file

Arthur f55822baea [pre_tokenizers] Fix sentencepiece based Metaspace (#1357 )

* nits

* allow for legacy beahaviour without making any breaking changes

* add a todo

* set to legacy by default

* skip legacy serialization

* push correct update

* lint

* add deserialization test

* add a python test as well

* updates

* fix serialization tests

* nits

* python stylijng of the tests

* better tests

* fix offsets

* fix imports

* fmt

* update metaspace

* remove TODO

* use enm

* fix some tses

* nits

* use enum

* update tests

* syling

* remove impl from for PrependScheme

* use simple getters and setters

* lint

* update tests

* add test new == new_with_prepend_scheme

* revert a change

* use setters and getterts

* Update bindings/python/src/pre_tokenizers.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* nits

* use copy rather than ref

* nits format

* more nits

* allow option string

* enforce First Never Always camel cased

* nits

* refactor

* update test as well

* fmt

* nits

* properly error out

* Update bindings/python/src/pre_tokenizers.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* suggestion changes

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

2023-11-14 18:05:07 +01:00

.github

fix: remove useless token (#1371 )

2023-10-19 14:29:01 +02:00

bindings

[pre_tokenizers] Fix sentencepiece based Metaspace (#1357 )

2023-11-14 18:05:07 +01:00

docs

Updating the docs with the new command. (#1333 )

2023-08-29 13:15:26 +02:00

tokenizers

[pre_tokenizers] Fix sentencepiece based Metaspace (#1357 )

2023-11-14 18:05:07 +01:00

.gitignore

Rvert main hiccup.

2023-05-15 18:01:29 +02:00

CITATION.cff

0.13.4.rc1 (#1319 )

2023-08-14 12:06:43 +02:00

LICENSE

Create LICENSE

2020-01-04 23:31:02 -05:00

README.md

master -> main (#1292 )

2023-07-12 11:51:22 +02:00

RELEASE.md

Adding a new document that is the checklist to make (#975 )

2022-04-12 14:18:09 +02:00

README.md

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

Train new vocabularies and tokenize, using today's most used tokenizers.
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Bindings

We provide bindings to the following languages (more to come!):

Rust (Original implementation)
Python
Node.js
Ruby (Contributed by @ankane, external repo)

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())

You can customize how pre-tokenization (e.g., splitting into words) is done:

from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

Then training your tokenizer on a set of files just takes two lines of codes:

from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)

Once your tokenizer is trained, encode any text with just one line:

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

Check the python documentation or the

python quicktour to learn more!

Languages

Rust 72.3%

Python 20%

Jupyter Notebook 4.5%

TypeScript 2.3%

JavaScript 0.4%

Other 0.5%