Files
tokenizers/bindings/node
Pierric Cistac c7f850415f generalize npm cache and forget about rust target
neon re-triggers complete rust build every time because of `artifacts.json` which is generated every time... (and cannot be versioned since it varies by platform)
2020-01-23 17:27:14 -05:00
..
2020-01-22 18:09:59 -05:00
2020-01-23 15:00:56 -05:00
2020-01-22 18:08:22 -05:00
2020-01-22 18:08:22 -05:00
2020-01-10 15:19:59 -05:00
2020-01-22 18:08:22 -05:00
2020-01-22 17:37:30 -05:00
2020-01-23 15:00:56 -05:00
2020-01-10 16:03:56 -05:00
2020-01-14 15:20:34 -05:00
2020-01-22 17:37:30 -05:00



Build GitHub


NodeJS implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Main features

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

npm install tokenizers

Basic example

import { BertWordPieceTokenizer } from "tokenizers";

const wordPieceTokenizer = await BertWordPieceTokenizer.fromOptions({ vocabFile: "./vocab.txt" });
const wpEncoded = await wordPieceTokenizer.encode("Who is John?", "John is a teacher");

console.log(wpEncoded.getTokens());
console.log(wpEncoded.getIds());
console.log(wpEncoded.getAttentionMask());
console.log(wpEncoded.getOffsets());
console.log(wpEncoded.getOverflowing());
console.log(wpEncoded.getSpecialTokensMask());
console.log(wpEncoded.getTypeIds());

Provided Tokenizers

  • BPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

License

Apache License 2.0