mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
* Add python 3.11 to manylinux buildwheels * Fixing clippy. * Node clippy. * Python clippy. * Changelog + version number update. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
16 KiB
16 KiB
[0.13.2]
- Python only chnages.
[0.13.1]
- #1072 Fixing Roberta type ids.
[0.13.0]
- #1008
Decoder
is now a composable trait, but without being backward incompatible - [#1047, #1051, #1052]
Processor
is now a composable trait, but without being backward incompatible
[0.12.1]
- #938 Reverted breaking change. https://github.com/huggingface/transformers/issues/16520
[0.12.0] YANKED
Bump minor version because of a breaking change.
Using 0.12
to match other bindings.
-
#938 [REVERTED IN 0.12.1] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
-
#939 Making the regex in
ByteLevel
pre_tokenizer optional (necessary for BigScience) -
#952 Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
-
#954 Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
-
#961 Added link for Ruby port of
tokenizers
0.8.0 (2021-09-02)
BREACKING CHANGES
- Many improvements on the Trainer (#519).
The files must now be provided first when calling
tokenizer.train(files, trainer)
.
Features
- Adding the
TemplateProcessing
- Add
WordLevel
andUnigram
models (#490) - Add
nmtNormalizer
andprecompiledNormalizer
normalizers (#490) - Add
templateProcessing
post-processor (#490) - Add
digitsPreTokenizer
pre-tokenizer (#490) - Add support for mapping to sequences (#506)
- Add
splitPreTokenizer
pre-tokenizer (#542) - Add
behavior
option to thepunctuationPreTokenizer
(#657) - Add the ability to load tokenizers from the Hugging Face Hub using
fromPretrained
(#780)
Fixes
- Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)
- Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)
0.7.0 (2020-07-01)
BREAKING CHANGES
robertaProcessing
now handles trimming the offsets (activated by default) (#236)charToTokenOffsets
,charToWordOffsets
andtokenToWordOffsets
helper functions onEncoding
instances are removed and replaced by newwordToTokens
,wordToChars
,tokenToChars
,tokenToWord
andcharToWord
methods (#234)encode
andencodeBatch
methods on a tokenizer now handle pre-tokenized inputs and have their signatures changed (#249). In addition:encodeTokenized
,encodeTokenizedBatch
methods are therefore removedInputSequence
,EncodeInput
andEncodeOptions
types are added
- Improve management of the additional vocabulary (#309):
- New parameter
normalized
inAddedToken
options, controlling whether a token should be extracted from the normalized version of the input text - The
AddedToken
constructor now takes aspecial
boolean as second parameter to indicate if the token is special (in this case it won't be normalized)
- New parameter
Features
- Serialization of a
Tokenizer
and all its parts (PreTokenizer
,Normalizer
, ...). This adds some methods to easily save/load an entire tokenizer: new static methodsfromString
/fromFile
, and instance methodssave
/toString
onBaseTokenizer
(#272) - New
padToMultipleOf
parameter forPaddingOptions
, to pad to a multiple of a specified value (#289) - Improved errors generated during truncation when the provided max length is too low (02cc977)
- Improve BPE training speeds, by reading files sequentially, but parallelizing the processing of each file (#276)
- Use
onig
for byte-level pre-tokenization to remove all the differences with the original implementation from GPT-2 (#280)
Fixes
- Fix various crash when training a BPE model (#286)
- Fix a few bugs related to additional vocabulary/tokens (#309)
0.6.2 (2020-04-13)
Features
- More symbols exposed:
Token
,BaseTokenizer
,PaddingConfiguration
,TruncationConfiguration
(38d53a7) - Expose
setPostProcessor
inBaseTokenizer
(38d53a7)
Fixes
- Fix the word indexes when there are special tokens (#226)
- Fix encoding overflowing offsets (695ab83)
- Fix Roberta overflowings (c4ecc6f)
0.6.1 (2020-04-01)
Fixes
- Fix special tokens with wrong id (b770f36)
- Fix
AddedToken
'sleftStrip
andrightStrip
params (thanks @thirdwing) (85488dd)
0.6.0 (2020-03-30)
BREAKING CHANGES
- The
getOriginalString
method onEncoding
s has been removed: this brings a reduction of 70% of the memory footprint. You can use the provided newslice
function as a replacement to get a subpart of a string according to specified indexes while respecting unicode characters. (#197) - The offsets provided on
Encoding
are now relative to the original string, and not the normalized one anymore (#197) - The added tokens given to
addTokens
,addSpecialTokens
ortrain
methods of a tokenizer can now be instances ofAddedToken
to provide more control over these tokens. The support of the[string, boolean]
format inaddTokens
method is removed. (#202) - The
addSpecialTokens
option forBertWordpieceTokenizer
has been removed, and must now be passed toencode
andencodeBatch
functions (7dd2400) (#193)
Features
encode
andencodeBatch
methods onBaseTokenizer
now take a new optional argument, specifying whether to add the special tokens (activated by default) (#193)- Methods
decode
anddecodeBatch
exposed inBaseTokenizer
instances (#184) - The
fromFiles
methods forBPE
andWordPiece
models are nowasync
(#184) - Big improvements in speed for BPE (both training and tokenization) (#165)
ByteLevel
is also aPostProcessor
now and handles trimming the offsets if activated. This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are part of the actual token. It has been added toByteLevelBPETokenizer
but it is off by default. (#188)- New
postProcess
,encodeTokenized
,encodeTokenizedBatch
andnormalize
methods onBaseTokenizer
(#200) (2aeae55) - New
mergeEncodings
static method onEncoding
class (#200) (0408567) - New
wordIndexes
getter and newcharToToken
,charToTokenOffsets
,charToWordOffsets
andtokenToWordOffsets
helper functions onEncoding
instances (#200) (ce3cf78)
Fixes
- Fix
longest_first
truncation strategy (#174) - Fix options names in
BPE.fromFiles
(306f427) - Actually expose
save
method inModel
(ddcf8e8) - The errors in async functions are now typed (7aa6c13)
- Trim the decoded string in
bpeDecoder
used byBPETokenizer
(#205) (3f4a6b7)
0.5.0 (2020-02-27)
BREAKING CHANGES
- The
Encoding
object now exposes getters instead ofget...
methods (except forgetOriginalString
) (9179968) BertWordPieceTokenizer
now cleans up some tokenization artifacts by default while decoding (#145) (#147)
Features
Encoding
exposes a newlength
property (9179968)- Add a new
stripNormalizer
(#140) (815d743) ByteLevelBPETokenizer
andBPETokenizer
accept more options (946ac1a)- Add
save
method toModel
class (aebc97e) - Improved padding performances (b30be3b) (0dc857e)
Fixes
- Methods accepting optional arguments now handle explicit
undefined
correctly (0fe22a7) - Special tokens are now declared only if present in the vocabulary (b70283c)
- Add missing mask/padding special tokens in wordpiece tokenizer (b70283c)
- Fix a bug in
ByteLevelBPETokenizer
that caused offsets to be wrong if a char got split up in multiple bytes (#156)
0.4.1 (2020-02-11)
Fixes
- Fix punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)
0.4.0 (2020-02-05)
BREAKING CHANGES
getOverflowing()
method onEncoding
now returns all the overflowingEncoding
s at once (#77) (0094393)
Features
- Add
setTruncation
,disableTruncation
,setPadding
anddisablePadding
methods inTokenizer
andBaseTokenizer
(#109) (78e2690) - Expose tokenizer / truncation / padding configuration in
BaseTokenizer
(#126) (cb8585b) - Expose
addTokens
,addSpecialTokens
,idToToken
andtokenToId
inBaseTokenizer
(7051480) - Add
getOriginalString()
method onEncoding
(a14c633) - Add
charDelimiterSplitPreTokenizer
: a newPreTokenizer
that allows splitting sequences on the given delimiter (works like.split(delimiter)
) (#114) (6165910) - Add
robertaProcessing
as a newPostProcessor
(#111) (6524f09)