tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Files

Arthur 864135bef1 Add unigram bytefallback (#1217 )

* current updates will go red

* cargo fmt

* npm install

* refactor train for unigram to allow bytefallbakc (breaking)

* fmt

* nits

* update

* add a proper test

* fix encode optimised fallback + add trainer arg

* fixes

* fixes

* fix tests

* add test

* fmt

* fix rust test

* update python bindings

* update

* pub is okay and needed

* more fix

* cleanup

* remove useles id

* MissingUnkId error

* nits

* fix offset

* add a test in python

* update src bindings

* remove bytefallback from trainer

* styling

* update pckg

* lint

* fmt

* stup with dev

* update code based on review

* remove unused function

* udpate python test to compare ids

* fix option bool issues

* final fix

* clippy

* fix npm isntall

* update

* update test

* more in depth testing

* Lint

* last attempt to fix node

* update node bindings

* fmt

* Update tokenizers/src/models/unigram/model.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* update based on review

* simpler test

* lint

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

2023-06-26 10:46:59 +02:00

examples/documentation

Node - Trainers train the Model in-place

2020-11-20 13:30:44 -05:00

lib

Add unigram bytefallback (#1217 )

2023-06-26 10:46:59 +02:00

native

Add unigram bytefallback (#1217 )

2023-06-26 10:46:59 +02:00

.eslintignore

add eslint/prettier

2020-01-22 18:08:22 -05:00

.eslintrc.json

Node - Update dev dependencies / fix lodash vulnerability

2020-07-24 15:43:54 -04:00

.gitignore

Doc - Rust snippets moved in tests

2020-11-02 17:07:27 -05:00

.prettierrc.json

add eslint/prettier

2020-01-22 18:08:22 -05:00

build.js

node: add enums for padding and truncation strategies

2020-02-05 14:28:53 -05:00

CHANGELOG.md

Add python 3.11 to manylinux buildwheels (#1096 )

2022-11-07 08:45:04 +01:00

jest.config.js

rust: derive Copy for PaddingDirection

2020-02-05 14:44:07 -05:00

Makefile

Add CTC Decoder for Wave2Vec models (#693 )

2021-05-20 09:30:09 -04:00

package-lock.json

Add unigram bytefallback (#1217 )

2023-06-26 10:46:59 +02:00

package.json

Add unigram bytefallback (#1217 )

2023-06-26 10:46:59 +02:00

README.md

Node - tweak readme

2020-03-30 14:25:18 -04:00

tsconfig.json

node: fix test config

2020-01-31 11:07:36 -05:00

tsconfig.lib.json

node: update ts build

2020-01-29 11:17:48 -05:00

tsconfig.prod.json

node: update ts build

2020-01-29 11:17:48 -05:00

README.md

NodeJS implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Main features

Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

npm install tokenizers@latest

Basic example

import { BertWordPieceTokenizer } from "tokenizers";

const wordPieceTokenizer = await BertWordPieceTokenizer.fromOptions({ vocabFile: "./vocab.txt" });
const wpEncoded = await wordPieceTokenizer.encode("Who is John?", "John is a teacher");

console.log(wpEncoded.length);
console.log(wpEncoded.tokens);
console.log(wpEncoded.ids);
console.log(wpEncoded.attentionMask);
console.log(wpEncoded.offsets);
console.log(wpEncoded.overflowing);
console.log(wpEncoded.specialTokensMask);
console.log(wpEncoded.typeIds);
console.log(wpEncoded.wordIndexes);

Provided Tokenizers

BPETokenizer: The original BPE
ByteLevelBPETokenizer: The byte level version of the BPE
SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

License

Apache License 2.0