tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-12-16 17:18:43 +00:00

Files

Pierric Cistac c7f850415f generalize npm cache and forget about rust target

neon re-triggers complete rust build every time because of `artifacts.json` which is generated every time... (and cannot be versioned since it varies by platform)

2020-01-23 17:27:14 -05:00

examples

npm run lint

2020-01-22 18:09:59 -05:00

lib

add more failing tests w/ last stable rust version

2020-01-22 18:10:11 -05:00

native

fix windows build

2020-01-23 15:00:56 -05:00

.eslintignore

add eslint/prettier

2020-01-22 18:08:22 -05:00

.eslintrc.json

add eslint/prettier

2020-01-22 18:08:22 -05:00

.gitignore

publish script

2020-01-10 15:19:59 -05:00

.prettierrc.json

add eslint/prettier

2020-01-22 18:08:22 -05:00

build.js

generalize npm cache and forget about rust target

2020-01-23 17:27:14 -05:00

jest.config.js

prepare for tests

2020-01-22 17:37:30 -05:00

package-lock.json

fix windows build

2020-01-23 15:00:56 -05:00

package.json

generalize npm cache and forget about rust target

2020-01-23 17:27:14 -05:00

README.md

first readme

2020-01-10 16:03:56 -05:00

tsconfig.json

update build script / actions

2020-01-14 15:20:34 -05:00

tsconfig.prod.json

prepare for tests

2020-01-22 17:37:30 -05:00

README.md

NodeJS implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Main features

Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

npm install tokenizers

Basic example

import { BertWordPieceTokenizer } from "tokenizers";

const wordPieceTokenizer = await BertWordPieceTokenizer.fromOptions({ vocabFile: "./vocab.txt" });
const wpEncoded = await wordPieceTokenizer.encode("Who is John?", "John is a teacher");

console.log(wpEncoded.getTokens());
console.log(wpEncoded.getIds());
console.log(wpEncoded.getAttentionMask());
console.log(wpEncoded.getOffsets());
console.log(wpEncoded.getOverflowing());
console.log(wpEncoded.getSpecialTokensMask());
console.log(wpEncoded.getTypeIds());

Provided Tokenizers

BPETokenizer: The original BPE
ByteLevelBPETokenizer: The byte level version of the BPE
SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

License

Apache License 2.0