tokenizers

mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Files

Arthur 09069717e9 Refactor metaspace (#1476 )

* version = "0.15.3-dev-0”

Improve performances of meta space, but also just fix it.

(transformers) ➜  transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
[0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848]
(transformers) ➜  transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
[0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]

* well what do we have

* nit

* be BC with non legacy

* unrelated change for clippy

* fix test

* splitting is a must for word_ids

* fmt and lint

* Fixing everything (hopefully better).

* Fixing node.

* Including yarn.lock

* Lint.

* Stubs.

* revert to use split

* fix merge issues

* fix tests

* finish fixing tests

* ruff

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

2024-03-30 10:27:24 +01:00

.cargo

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

.yarn/releases

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

examples/documentation

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

lib/bindings

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

npm

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

src

Refactor metaspace (#1476 )

2024-03-30 10:27:24 +01:00

.editorconfig

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

.eslintrc.yml

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

.gitattributes

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

.gitignore

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

.prettierignore

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

.taplo.toml

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

.yarnrc.yml

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

build.rs

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

Cargo.toml

version = "0.15.3-dev-0”

2024-02-12 09:48:00 +09:00

index.d.ts

Refactor metaspace (#1476 )

2024-03-30 10:27:24 +01:00

index.js

Refactor metaspace (#1476 )

2024-03-30 10:27:24 +01:00

jest.config.js

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

LICENSE

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

Makefile

Add CTC Decoder for Wave2Vec models (#693 )

2021-05-20 09:30:09 -04:00

package.json

Refactor metaspace (#1476 )

2024-03-30 10:27:24 +01:00

README.md

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

rustfmt.toml

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

tsconfig.json

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

types.ts

Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331 )

2023-08-28 16:24:14 +02:00

yarn.lock

Refactor metaspace (#1476 )

2024-03-30 10:27:24 +01:00

README.md

NodeJS implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Main features

Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

npm install tokenizers@latest

Basic example

import { Tokenizer } from "tokenizers";

const tokenizer = await Tokenizer.fromFile("tokenizer.json");
const wpEncoded = await tokenizer.encode("Who is John?");

console.log(wpEncoded.getLength());
console.log(wpEncoded.getTokens());
console.log(wpEncoded.getIds());
console.log(wpEncoded.getAttentionMask());
console.log(wpEncoded.getOffsets());
console.log(wpEncoded.getOverflowing());
console.log(wpEncoded.getSpecialTokensMask());
console.log(wpEncoded.getTypeIds());
console.log(wpEncoded.getWordIds());

License

Apache License 2.0