mirror of https://github.com/mii443/tokenizers.git synced 2025-08-22 16:25:30 +00:00

Files

Cameron 11bb2e00f2 Add python 3.11 to manylinux buildwheels (#1096 )

* Add python 3.11 to manylinux buildwheels

* Fixing clippy.

* Node clippy.

* Python clippy.

* Changelog + version number update.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

2022-11-07 08:45:04 +01:00

16 KiB

Raw Blame History

[0.13.2]

Python only chnages.

[0.13.1]

#1072 Fixing Roberta type ids.

[0.13.0]

#1008 Decoder is now a composable trait, but without being backward incompatible
[#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible

[0.12.1]

#938 Reverted breaking change. https://github.com/huggingface/transformers/issues/16520

[0.12.0] YANKED

Bump minor version because of a breaking change. Using 0.12 to match other bindings.

#938 [REVERTED IN 0.12.1] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
#939 Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)
#952 Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
#954 Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
#961 Added link for Ruby port of tokenizers

0.8.0 (2021-09-02)

BREACKING CHANGES

Many improvements on the Trainer (#519). The files must now be provided first when calling tokenizer.train(files, trainer).

Features

Adding the TemplateProcessing
Add WordLevel and Unigram models (#490)
Add nmtNormalizer and precompiledNormalizer normalizers (#490)
Add templateProcessing post-processor (#490)
Add digitsPreTokenizer pre-tokenizer (#490)
Add support for mapping to sequences (#506)
Add splitPreTokenizer pre-tokenizer (#542)
Add behavior option to the punctuationPreTokenizer (#657)
Add the ability to load tokenizers from the Hugging Face Hub using fromPretrained (#780)

Fixes

Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)
Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)

0.7.0 (2020-07-01)

BREAKING CHANGES

robertaProcessing now handles trimming the offsets (activated by default) (#236)
charToTokenOffsets, charToWordOffsets and tokenToWordOffsets helper functions on Encoding instances are removed and replaced by new wordToTokens, wordToChars, tokenToChars, tokenToWord and charToWord methods (#234)
encode and encodeBatch methods on a tokenizer now handle pre-tokenized inputs and have their signatures changed (#249). In addition:
- encodeTokenized, encodeTokenizedBatch methods are therefore removed
- InputSequence, EncodeInput and EncodeOptions types are added
Improve management of the additional vocabulary (#309):
- New parameter normalized in AddedToken options, controlling whether a token should be extracted from the normalized version of the input text
- The AddedToken constructor now takes a special boolean as second parameter to indicate if the token is special (in this case it won't be normalized)

Features

Serialization of a Tokenizer and all its parts (PreTokenizer, Normalizer, ...). This adds some methods to easily save/load an entire tokenizer: new static methods fromString / fromFile, and instance methods save / toString on BaseTokenizer (#272)
New padToMultipleOf parameter for PaddingOptions, to pad to a multiple of a specified value (#289)
Improved errors generated during truncation when the provided max length is too low (02cc977)
Improve BPE training speeds, by reading files sequentially, but parallelizing the processing of each file (#276)
Use onig for byte-level pre-tokenization to remove all the differences with the original implementation from GPT-2 (#280)

Fixes

Fix various crash when training a BPE model (#286)
Fix a few bugs related to additional vocabulary/tokens (#309)

0.6.2 (2020-04-13)

Features

More symbols exposed: Token, BaseTokenizer, PaddingConfiguration, TruncationConfiguration (38d53a7)
Expose setPostProcessor in BaseTokenizer (38d53a7)

Fixes

Fix the word indexes when there are special tokens (#226)
Fix encoding overflowing offsets (695ab83)
Fix Roberta overflowings (c4ecc6f)

0.6.1 (2020-04-01)

Fixes

Fix special tokens with wrong id (b770f36)
Fix AddedToken's leftStrip and rightStrip params (thanks @thirdwing) (85488dd)

0.6.0 (2020-03-30)

BREAKING CHANGES

The getOriginalString method on Encodings has been removed: this brings a reduction of 70% of the memory footprint. You can use the provided new slice function as a replacement to get a subpart of a string according to specified indexes while respecting unicode characters. (#197)
The offsets provided on Encoding are now relative to the original string, and not the normalized one anymore (#197)
The added tokens given to addTokens, addSpecialTokens or train methods of a tokenizer can now be instances of AddedToken to provide more control over these tokens. The support of the [string, boolean] format in addTokens method is removed. (#202)
The addSpecialTokens option for BertWordpieceTokenizer has been removed, and must now be passed to encode and encodeBatch functions (7dd2400) (#193)

Features

encode and encodeBatch methods on BaseTokenizer now take a new optional argument, specifying whether to add the special tokens (activated by default) (#193)
Methods decode and decodeBatch exposed in BaseTokenizer instances (#184)
The fromFiles methods for BPE and WordPiece models are now async (#184)
Big improvements in speed for BPE (both training and tokenization) (#165)
ByteLevel is also a PostProcessor now and handles trimming the offsets if activated. This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are part of the actual token. It has been added to ByteLevelBPETokenizer but it is off by default. (#188)
New postProcess, encodeTokenized, encodeTokenizedBatch and normalize methods on BaseTokenizer (#200) (2aeae55)
New mergeEncodings static method on Encoding class (#200) (0408567)
New wordIndexes getter and new charToToken, charToTokenOffsets, charToWordOffsets and tokenToWordOffsets helper functions on Encoding instances (#200) (ce3cf78)

Fixes

Fix longest_first truncation strategy (#174)
Fix options names in BPE.fromFiles (306f427)
Actually expose save method in Model (ddcf8e8)
The errors in async functions are now typed (7aa6c13)
Trim the decoded string in bpeDecoder used by BPETokenizer (#205) (3f4a6b7)

0.5.0 (2020-02-27)

BREAKING CHANGES

The Encoding object now exposes getters instead of get... methods (except for getOriginalString) (9179968)
BertWordPieceTokenizer now cleans up some tokenization artifacts by default while decoding (#145) (#147)

Features

Encoding exposes a new length property (9179968)
Add a new stripNormalizer (#140) (815d743)
ByteLevelBPETokenizer and BPETokenizer accept more options (946ac1a)
Add save method to Model class (aebc97e)
Improved padding performances (b30be3b) (0dc857e)

Fixes

Methods accepting optional arguments now handle explicit undefined correctly (0fe22a7)
Special tokens are now declared only if present in the vocabulary (b70283c)
Add missing mask/padding special tokens in wordpiece tokenizer (b70283c)
Fix a bug in ByteLevelBPETokenizer that caused offsets to be wrong if a char got split up in multiple bytes (#156)

0.4.1 (2020-02-11)

Fixes

Fix punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)

0.4.0 (2020-02-05)

BREAKING CHANGES

getOverflowing() method on Encoding now returns all the overflowing Encodings at once (#77) (0094393)

Features

Add setTruncation, disableTruncation, setPadding and disablePadding methods in Tokenizer and BaseTokenizer (#109) (78e2690)
Expose tokenizer / truncation / padding configuration in BaseTokenizer (#126) (cb8585b)
Expose addTokens, addSpecialTokens, idToToken and tokenToId in BaseTokenizer (7051480)
Add getOriginalString() method on Encoding (a14c633)
Add charDelimiterSplitPreTokenizer: a new PreTokenizer that allows splitting sequences on the given delimiter (works like .split(delimiter)) (#114) (6165910)
Add robertaProcessing as a new PostProcessor (#111) (6524f09)

Fixes

Correctly truncate with OnlyFirst and OnlySecond strategies (#108) (6d532fe)
Fix default special tokens in BertWordPieceTokenizer (10e2d28)
Fix return type of getSpecialTokensMask on Encoding (9770be5)
Actually add special tokens in tokenizers implementations (acef252)

16 KiB Raw Blame History

[0.13.2]

[0.13.1]

[0.13.0]

[0.12.1]

[0.12.0] YANKED

0.8.0 (2021-09-02)

BREACKING CHANGES

Features

Fixes

0.7.0 (2020-07-01)

BREAKING CHANGES

Features

Fixes

0.6.2 (2020-04-13)

Features

Fixes

0.6.1 (2020-04-01)

Fixes

0.6.0 (2020-03-30)

BREAKING CHANGES

Features

Fixes

0.5.0 (2020-02-27)

BREAKING CHANGES

Features

Fixes

0.4.1 (2020-02-11)

Fixes

0.4.0 (2020-02-05)

BREAKING CHANGES

Features

Fixes

16 KiB

Raw Blame History