* Add python 3.11 to manylinux buildwheels * Fixing clippy. * Node clippy. * Python clippy. * Changelog + version number update. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
22 KiB
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[0.13.2]
- #1096 Python 3.11 support
[0.13.1]
- #1072 Fixing Roberta type ids.
[0.13.0]
- #956 PyO3 version upgrade
- #1055 M1 automated builds
- #1008
Decoder
is now a composable trait, but without being backward incompatible - [#1047, #1051, #1052]
Processor
is now a composable trait, but without being backward incompatible
Both trait changes warrant a "major" number since, despite best efforts to not break backward compatibility, the code is different enough that we cannot be exactly sure.
[0.12.1]
- #938 Reverted breaking change. https://github.com/huggingface/transformers/issues/16520
[0.12.0] YANKED
Bump minor version because of a breaking change.
-
#938 [REVERTED IN 0.12.1] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
-
#939 Making the regex in
ByteLevel
pre_tokenizer optional (necessary for BigScience) -
#952 Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
-
#954 Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
-
#962 Fix tests for python 3.10
-
#961 Added link for Ruby port of
tokenizers
[0.11.6]
- #919 Fixing single_word AddedToken. (regression from 0.11.2)
- #916 Deserializing faster
added_tokens
by loading them in batch.
[0.11.5]
- #895 Build
python 3.10
wheels.
[0.11.4]
- #884 Fixing bad deserialization following inclusion of a default for Punctuation
[0.11.3]
- #882 Fixing Punctuation deserialize without argument.
- #868 Fixing missing direction in TruncationParams
- #860 Adding TruncationSide to TruncationParams
[0.11.0]
Fixed
- #585 Conda version should now work on old CentOS
- #844 Fixing interaction between
is_pretokenized
andtrim_offsets
. - #851 Doc links
Added
- #657: Add SplitDelimiterBehavior customization to Punctuation constructor
- #845: Documentation for
Decoders
.
Changed
- #850: Added a feature gate to enable disabling
http
features - #718: Fix
WordLevel
tokenizer determinism during training - #762: Add a way to specify the unknown token in
SentencePieceUnigramTokenizer
- #770: Improved documentation for
UnigramTrainer
- #780: Add
Tokenizer.from_pretrained
to load tokenizers from the Hugging Face Hub - #793: Saving a pretty JSON file by default when saving a tokenizer
[0.10.3]
Fixed
- #686: Fix SPM conversion process for whitespace deduplication
- #707: Fix stripping strings containing Unicode characters
Added
- #693: Add a CTC Decoder for Wave2Vec models
Removed
- #714: Removed support for Python 3.5
[0.10.2]
Fixed
- #652: Fix offsets for
Precompiled
corner case - #656: Fix BPE
continuing_subword_prefix
- #674: Fix
Metaspace
serialization problems
[0.10.1]
Fixed
- #616: Fix SentencePiece tokenizers conversion
- #617: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
- #618: Fix Normalizer.normalize with
PyNormalizedStringRefMut
- #620: Fix serialization/deserialization for overlapping models
- #621: Fix
ByteLevel
instantiation from a previously saved state (using__getstate__()
)
[0.10.0]
Added
- #508: Add a Visualizer for notebooks to help understand how the tokenizers work
- #519: Add a
WordLevelTrainer
used to train aWordLevel
model - #533: Add support for conda builds
- #542: Add Split pre-tokenizer to easily split using a pattern
- #544: Ability to train from memory. This also improves the integration with
datasets
- #590: Add getters/setters for components on BaseTokenizer
- #574: Add
fust_unk
option to SentencePieceBPETokenizer
Changed
- #509: Automatically stubbing the
.pyi
files - #519: Each
Model
can return its associatedTrainer
withget_trainer()
- #530: The various attributes on each component can be get/set (ie.
tokenizer.model.dropout = 0.1
) - #538: The API Reference has been improved and is now up-to-date.
Fixed
- #519: During training, the
Model
is now trained in-place. This fixes several bugs that were forcing to reload theModel
after a training. - #539: Fix
BaseTokenizer
enable_truncation docstring
[0.9.4]
Fixed
- #492: Fix
from_file
onBertWordPieceTokenizer
- #498: Fix the link to download
sentencepiece_model_pb2.py
- #500: Fix a typo in the docs quicktour
Changed
- #506: Improve Encoding mappings for pairs of sequence
[0.9.3]
Fixed
- #470: Fix hanging error when training with custom component
- #476: TemplateProcessing serialization is now deterministic
- #481: Fix SentencePieceBPETokenizer.from_files
Added
- #477: UnicodeScripts PreTokenizer to avoid merges between various scripts
- #480: Unigram now accepts an
initial_alphabet
and handlesspecial_tokens
correctly
[0.9.2]
Fixed
- #464: Fix a problem with RobertaProcessing being deserialized as BertProcessing
[0.9.1]
Fixed
- #459: Fix a problem with deserialization
[0.9.0]
Fixed
- #362: Fix training deadlock with Python components.
- #363: Fix a crash when calling
.train
with some non-existent files - #355: Remove a lot of possible crashes
- #389: Improve truncation (crash and consistency)
Added
- #379: Add the ability to call
encode
/encode_batch
with numpy arrays - #292: Support for the Unigram algorithm
- #378, #394, #416, #417: Many new Normalizer and PreTokenizer
- #403: Add
TemplateProcessing
PostProcessor
. - #420: Ability to fuse the "unk" token in BPE.
Changed
- #360: Lots of improvements related to words/alignment tracking
- [#426]: Improvements on error messages thanks to PyO3 0.12
[0.8.1]
Fixed
- #333: Fix deserialization of
AddedToken
, where the content was not restored properly
Changed
- #329: Improved warning and behavior when we detect a fork
- #330: BertNormalizer now keeps the same behavior than the original implementation when
strip_accents
is not specified.
[0.8.0]
Highlights of this release
- We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps while applying labels to each word.
- Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later load it back with just one line of code. That's what sharing a Tokenizer means now: 1 line of code.
- With the serialization comes the compatibility with
Pickle
! The Tokenizer, all of its components, Encodings, everything can be pickled! - Training a tokenizer is now even faster (up to 5-10x) than before!
- Compatibility with
multiprocessing
, even when using thefork
start method. Since this library makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization, this led to problems (deadlocks) when used withmultiprocessing
. This version now allows to disable the parallelism, and will warn you if this is necessary. - And a lot of other improvements, and fixes.
Fixed
- #286: Fix various crash when training a BPE model
- #309: Fixed a few bugs related to additional vocabulary/tokens
Added
- #272: Serialization of the
Tokenizer
and all the parts (PreTokenizer
,Normalizer
, ...). This adds some methods to easily save/load an entire tokenizer (from_str
,from_file
). - #273:
Tokenizer
and its parts are now pickable - #289: Ability to pad to a multiple of a specified value. This is especially useful to ensure
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
enable_padding(pad_to_multiple_of=8)
for example. - [#298]: Ability to get the currently set truncation/padding params
- #311: Ability to enable/disable the parallelism using the
TOKENIZERS_PARALLELISM
environment variable. This is especially usefull when usingmultiprocessing
capabilities, with thefork
start method, which happens to be the default on Linux systems. Without disabling the parallelism, the process dead-locks while encoding. (Cf #187 for more information)
Changed
- Improved errors generated during truncation: When the provided max length is too low are now handled properly.
- #249
encode
andencode_batch
now accept pre-tokenized inputs. When the input is pre-tokenized, the argumentis_pretokenized=True
must be specified. - #276: Improve BPE training speeds, by reading files sequentially, but parallelizing the processing of each file
- #280: Use
onig
for byte-level pre-tokenization to remove all the differences with the original implementation from GPT-2 - #309: Improved the management of the additional vocabulary. This introduces an option
normalized
, controlling whether a token should be extracted from the normalized version of the input text.
[0.7.0]
Changed
- Only one progress bar while reading files during training. This is better for use-cases with a high number of files as it avoids having too many progress bars on screen. Also avoids reading the size of each file before starting to actually read these files, as this process could take really long.
- #193:
encode
andencode_batch
now take a new optional argument, specifying whether we should add the special tokens. This is activated by default. - #197:
original_str
andnormalized_str
have been removed from theEncoding
returned byencode
andencode_batch
. This brings a reduction of 70% of the memory footprint. - #197: The offsets provided on
Encoding
are now relative to the original string, and not the normalized one anymore. - The added token given to
add_special_tokens
oradd_tokens
on aTokenizer
, or while usingtrain(special_tokens=...)
can now be instances ofAddedToken
to provide more control over these tokens. - [#136]: Updated Pyo3 version
- [#136]: Static methods
Model.from_files
andModel.empty
are removed in favor of using constructors. - #239:
CharBPETokenizer
now corresponds to OpenAI GPT BPE implementation by default.
Added
- #188:
ByteLevel
is also aPostProcessor
now and handles trimming the offsets if activated. This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are part of the actual token. It has been added toByteLevelBPETokenizer
but it is off by default (trim_offsets=False
). - #236:
RobertaProcessing
also handles trimming the offsets. - #234: New alignment mappings on the
Encoding
. Provide methods to easily convert betweenchar
orword
(input space) andtoken
(output space). post_process
can be called on theTokenizer
- #208: Ability to retrieve the vocabulary from the
Tokenizer
withget_vocab(with_added_tokens: bool)
- [#136] Models can now be instantiated through object constructors.
Fixed
- #193: Fix some issues with the offsets being wrong with the
ByteLevel
BPE:- when
add_prefix_space=True
- #156: when a Unicode character gets split-up in multiple byte-level characters
- when
- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
- #175: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not advised, but that's not the question).
- #205: Trim the decoded string in
BPEDecoder
used byCharBPETokenizer
How to migrate
- Add the
ByteLevel
PostProcessor
to your byte-level BPE tokenizers if relevant. If you are usingByteLevelBPETokenizer
, this option is disabled by default (trim_offsets=False
). BertWordPieceTokenizer
option toadd_special_tokens
must now be given toencode
orencode_batch
- Access to the
original_str
on theEncoding
has been removed. The original string is the input ofencode
so it didn't make sense to keep it here. - No need to call
original_str.offsets(offsets[N])
to convert offsets to the original string. They are now relative to the original string by default. - Access to the
normalized_str
on theEncoding
has been removed. Can be retrieved by callingnormalize(sequence)
on theTokenizer
- Change
Model.from_files
andModel.empty
to use constructor. The model constructor should take the same arguments as the old methods. (ieBPE(vocab, merges)
orBPE()
) - If you were using the
CharBPETokenizer
and want to keep the same behavior as before, setbert_normalizer=False
andsplit_on_whitespace_only=True
.
[0.6.0]
Changed
- #165: Big improvements in speed for BPE (Both training and tokenization)
Fixed
- #160: Some default tokens were missing from
BertWordPieceTokenizer
- #156: There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up in multiple bytes.
- #174: The
longest_first
truncation strategy had a bug
[0.5.2]
- #163: Do not open all files directly while training
Fixed
- We introduced a bug related to the saving of the WordPiece model in 0.5.1: The
vocab.txt
file was namedvocab.json
. This is now fixed. - The
WordLevel
model was also saving its vocabulary to the wrong format.
[0.5.1]
Changed
name
argument is now optional when saving aModel
's vocabulary. When the name is not specified, the files get a more generic naming, likevocab.json
ormerges.txt
.
[0.5.0]
Changed
- #145:
BertWordPieceTokenizer
now cleans up some tokenization artifacts while decoding - #149:
ByteLevelBPETokenizer
now hasdropout
. do_lowercase
has been changed tolowercase
for consistency between the different tokenizers. (EspeciallyByteLevelBPETokenizer
andCharBPETokenizer
)- #139: Expose
__len__
onEncoding
- Improved padding performances.
Added
- Added a new
Strip
normalizer
Fixed
- #145: Decoding was buggy on
BertWordPieceTokenizer
. - #152: Some documentation and examples were still using the old
BPETokenizer
How to migrate
- Use
lowercase
when initializingByteLevelBPETokenizer
orCharBPETokenizer
instead ofdo_lowercase
.
[0.4.2]
Fixed
- #137: Fix a bug in the class
WordPieceTrainer
that preventedBertWordPieceTokenizer
from being trained.
[0.4.1]
Fixed
- #134: Fix a bug related to the punctuation in BertWordPieceTokenizer
[0.4.0]
Changed
- #131: Replaced all .new() class methods by a proper new implementation
- Improved typings
How to migrate
- Remove all
.new
on all classe instanciations
[0.3.0]
Changed
- BPETokenizer has been renamed to CharBPETokenizer for clarity.
- Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets
truncated, we provide a list of overflowing
Encoding
that are ready to be processed by a language model, just as the mainEncoding
. - Provide mapping to the original string offsets using:
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
- #99: Exposed the vocabulary size on all tokenizers
Added
- Added
CharDelimiterSplit
: a newPreTokenizer
that allows splitting sequences on the given delimiter (Works like.split(delimiter)
) - Added
WordLevel
: a new model that simply mapstokens
to theirids
.
Fixed
- Fix a bug with IndexableString
- Fix a bug with truncation
How to migrate
- Rename
BPETokenizer
toCharBPETokenizer
Encoding.overflowing
is now a List instead of aOptional[Encoding]
[0.2.1]
Fixed
- Fix a bug with the IDs associated with added tokens.
- Fix a bug that was causing crashes in Python 3.5