tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Files

Chris Ha cefc41e8ec implement a simple max_sentencepiece_length into BPE (#1228 )

* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* implement a simple max_sentencepiece_length into BPE

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

* utilize Option<u16> for safer code.

* Other version.

* Update trainer.rs

clarify with type usize propagate max_length option

* change max_length into more descriptive name

in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.

* change variable name in trainer.rs

change max_merge_length into max_token_length

* Update trainer.rs

add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer

Add explanation for variable shadowing.

* Update trainer.rs

Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.

* add max_token_length test

* Add bpe direct assert test

* Update trainer.rs

clarified test documentation

* Creating the bindings.

* Fix the default.

* Re-adding missing package-lock which I accidentally removed.

* ..

* Fixing trainer test.

* Fix.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

2023-05-16 10:08:19 +02:00

examples/documentation

Node - Trainers train the Model in-place

2020-11-20 13:30:44 -05:00

lib

Add content to Strip decoder to allow decoding mid tokens. (#1199 )

2023-03-24 10:14:49 +01:00

native

New version 0.13.3

2023-04-04 14:14:17 +02:00

.eslintignore

add eslint/prettier

2020-01-22 18:08:22 -05:00

.eslintrc.json

Node - Update dev dependencies / fix lodash vulnerability

2020-07-24 15:43:54 -04:00

.gitignore

Doc - Rust snippets moved in tests

2020-11-02 17:07:27 -05:00

.prettierrc.json

add eslint/prettier

2020-01-22 18:08:22 -05:00

build.js

node: add enums for padding and truncation strategies

2020-02-05 14:28:53 -05:00

CHANGELOG.md

Add python 3.11 to manylinux buildwheels (#1096 )

2022-11-07 08:45:04 +01:00

jest.config.js

rust: derive Copy for PaddingDirection

2020-02-05 14:44:07 -05:00

Makefile

Add CTC Decoder for Wave2Vec models (#693 )

2021-05-20 09:30:09 -04:00

package-lock.json

implement a simple max_sentencepiece_length into BPE (#1228 )

2023-05-16 10:08:19 +02:00

package.json

New version 0.13.3

2023-04-04 14:14:17 +02:00

README.md

Node - tweak readme

2020-03-30 14:25:18 -04:00

tsconfig.json

node: fix test config

2020-01-31 11:07:36 -05:00

tsconfig.lib.json

node: update ts build

2020-01-29 11:17:48 -05:00

tsconfig.prod.json

node: update ts build

2020-01-29 11:17:48 -05:00

README.md

NodeJS implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Main features

Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

npm install tokenizers@latest

Basic example

import { BertWordPieceTokenizer } from "tokenizers";

const wordPieceTokenizer = await BertWordPieceTokenizer.fromOptions({ vocabFile: "./vocab.txt" });
const wpEncoded = await wordPieceTokenizer.encode("Who is John?", "John is a teacher");

console.log(wpEncoded.length);
console.log(wpEncoded.tokens);
console.log(wpEncoded.ids);
console.log(wpEncoded.attentionMask);
console.log(wpEncoded.offsets);
console.log(wpEncoded.overflowing);
console.log(wpEncoded.specialTokensMask);
console.log(wpEncoded.typeIds);
console.log(wpEncoded.wordIndexes);

Provided Tokenizers

BPETokenizer: The original BPE
ByteLevelBPETokenizer: The byte level version of the BPE
SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

License

Apache License 2.0