mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
* Add python 3.11 to manylinux buildwheels * Fixing clippy. * Node clippy. * Python clippy. * Changelog + version number update. Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
513 lines
22 KiB
Markdown
513 lines
22 KiB
Markdown
# Changelog
|
|
All notable changes to this project will be documented in this file.
|
|
|
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
|
|
## [0.13.2]
|
|
|
|
- [#1096] Python 3.11 support
|
|
|
|
## [0.13.1]
|
|
|
|
- [#1072] Fixing Roberta type ids.
|
|
|
|
## [0.13.0]
|
|
|
|
- [#956] PyO3 version upgrade
|
|
- [#1055] M1 automated builds
|
|
- [#1008] `Decoder` is now a composable trait, but without being backward incompatible
|
|
- [#1047, #1051, #1052] `Processor` is now a composable trait, but without being backward incompatible
|
|
|
|
Both trait changes warrant a "major" number since, despite best efforts to not break backward
|
|
compatibility, the code is different enough that we cannot be exactly sure.
|
|
|
|
## [0.12.1]
|
|
|
|
- [#938] **Reverted breaking change**. https://github.com/huggingface/transformers/issues/16520
|
|
|
|
## [0.12.0] YANKED
|
|
|
|
Bump minor version because of a breaking change.
|
|
|
|
- [#938] [REVERTED IN 0.12.1] **Breaking change**. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
|
|
- [#939] Making the regex in `ByteLevel` pre_tokenizer optional (necessary for BigScience)
|
|
|
|
- [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
|
|
- [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
|
|
- [#962] Fix tests for python 3.10
|
|
- [#961] Added link for Ruby port of `tokenizers`
|
|
|
|
## [0.11.6]
|
|
|
|
- [#919] Fixing single_word AddedToken. (regression from 0.11.2)
|
|
- [#916] Deserializing faster `added_tokens` by loading them in batch.
|
|
|
|
## [0.11.5]
|
|
|
|
- [#895] Build `python 3.10` wheels.
|
|
|
|
## [0.11.4]
|
|
|
|
- [#884] Fixing bad deserialization following inclusion of a default for Punctuation
|
|
|
|
## [0.11.3]
|
|
|
|
- [#882] Fixing Punctuation deserialize without argument.
|
|
- [#868] Fixing missing direction in TruncationParams
|
|
- [#860] Adding TruncationSide to TruncationParams
|
|
|
|
## [0.11.0]
|
|
|
|
### Fixed
|
|
|
|
- [#585] Conda version should now work on old CentOS
|
|
- [#844] Fixing interaction between `is_pretokenized` and `trim_offsets`.
|
|
- [#851] Doc links
|
|
|
|
### Added
|
|
- [#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
|
|
- [#845]: Documentation for `Decoders`.
|
|
|
|
### Changed
|
|
- [#850]: Added a feature gate to enable disabling `http` features
|
|
- [#718]: Fix `WordLevel` tokenizer determinism during training
|
|
- [#762]: Add a way to specify the unknown token in `SentencePieceUnigramTokenizer`
|
|
- [#770]: Improved documentation for `UnigramTrainer`
|
|
- [#780]: Add `Tokenizer.from_pretrained` to load tokenizers from the Hugging Face Hub
|
|
- [#793]: Saving a pretty JSON file by default when saving a tokenizer
|
|
|
|
## [0.10.3]
|
|
|
|
### Fixed
|
|
- [#686]: Fix SPM conversion process for whitespace deduplication
|
|
- [#707]: Fix stripping strings containing Unicode characters
|
|
|
|
### Added
|
|
- [#693]: Add a CTC Decoder for Wave2Vec models
|
|
|
|
### Removed
|
|
- [#714]: Removed support for Python 3.5
|
|
|
|
## [0.10.2]
|
|
|
|
### Fixed
|
|
- [#652]: Fix offsets for `Precompiled` corner case
|
|
- [#656]: Fix BPE `continuing_subword_prefix`
|
|
- [#674]: Fix `Metaspace` serialization problems
|
|
|
|
## [0.10.1]
|
|
|
|
### Fixed
|
|
- [#616]: Fix SentencePiece tokenizers conversion
|
|
- [#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
|
|
- [#618]: Fix Normalizer.normalize with `PyNormalizedStringRefMut`
|
|
- [#620]: Fix serialization/deserialization for overlapping models
|
|
- [#621]: Fix `ByteLevel` instantiation from a previously saved state (using `__getstate__()`)
|
|
|
|
## [0.10.0]
|
|
|
|
### Added
|
|
- [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
|
|
- [#519]: Add a `WordLevelTrainer` used to train a `WordLevel` model
|
|
- [#533]: Add support for conda builds
|
|
- [#542]: Add Split pre-tokenizer to easily split using a pattern
|
|
- [#544]: Ability to train from memory. This also improves the integration with `datasets`
|
|
- [#590]: Add getters/setters for components on BaseTokenizer
|
|
- [#574]: Add `fust_unk` option to SentencePieceBPETokenizer
|
|
|
|
### Changed
|
|
- [#509]: Automatically stubbing the `.pyi` files
|
|
- [#519]: Each `Model` can return its associated `Trainer` with `get_trainer()`
|
|
- [#530]: The various attributes on each component can be get/set (ie.
|
|
`tokenizer.model.dropout = 0.1`)
|
|
- [#538]: The API Reference has been improved and is now up-to-date.
|
|
|
|
### Fixed
|
|
- [#519]: During training, the `Model` is now trained in-place. This fixes several bugs that were
|
|
forcing to reload the `Model` after a training.
|
|
- [#539]: Fix `BaseTokenizer` enable_truncation docstring
|
|
|
|
## [0.9.4]
|
|
|
|
### Fixed
|
|
- [#492]: Fix `from_file` on `BertWordPieceTokenizer`
|
|
- [#498]: Fix the link to download `sentencepiece_model_pb2.py`
|
|
- [#500]: Fix a typo in the docs quicktour
|
|
|
|
### Changed
|
|
- [#506]: Improve Encoding mappings for pairs of sequence
|
|
|
|
## [0.9.3]
|
|
|
|
### Fixed
|
|
- [#470]: Fix hanging error when training with custom component
|
|
- [#476]: TemplateProcessing serialization is now deterministic
|
|
- [#481]: Fix SentencePieceBPETokenizer.from_files
|
|
|
|
### Added
|
|
- [#477]: UnicodeScripts PreTokenizer to avoid merges between various scripts
|
|
- [#480]: Unigram now accepts an `initial_alphabet` and handles `special_tokens` correctly
|
|
|
|
## [0.9.2]
|
|
|
|
### Fixed
|
|
- [#464]: Fix a problem with RobertaProcessing being deserialized as BertProcessing
|
|
|
|
## [0.9.1]
|
|
|
|
### Fixed
|
|
- [#459]: Fix a problem with deserialization
|
|
|
|
## [0.9.0]
|
|
|
|
### Fixed
|
|
- [#362]: Fix training deadlock with Python components.
|
|
- [#363]: Fix a crash when calling `.train` with some non-existent files
|
|
- [#355]: Remove a lot of possible crashes
|
|
- [#389]: Improve truncation (crash and consistency)
|
|
|
|
### Added
|
|
- [#379]: Add the ability to call `encode`/`encode_batch` with numpy arrays
|
|
- [#292]: Support for the Unigram algorithm
|
|
- [#378], [#394], [#416], [#417]: Many new Normalizer and PreTokenizer
|
|
- [#403]: Add `TemplateProcessing` `PostProcessor`.
|
|
- [#420]: Ability to fuse the "unk" token in BPE.
|
|
|
|
### Changed
|
|
- [#360]: Lots of improvements related to words/alignment tracking
|
|
- [#426]: Improvements on error messages thanks to PyO3 0.12
|
|
|
|
## [0.8.1]
|
|
|
|
### Fixed
|
|
- [#333]: Fix deserialization of `AddedToken`, where the content was not restored properly
|
|
|
|
### Changed
|
|
- [#329]: Improved warning and behavior when we detect a fork
|
|
- [#330]: BertNormalizer now keeps the same behavior than the original implementation when
|
|
`strip_accents` is not specified.
|
|
|
|
## [0.8.0]
|
|
|
|
### Highlights of this release
|
|
- We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when
|
|
processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps
|
|
while applying labels to each word.
|
|
- Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later
|
|
load it back with just one line of code. That's what sharing a Tokenizer means now: 1 line of code.
|
|
- With the serialization comes the compatibility with `Pickle`! The Tokenizer, all of its components,
|
|
Encodings, everything can be pickled!
|
|
- Training a tokenizer is now even faster (up to 5-10x) than before!
|
|
- Compatibility with `multiprocessing`, even when using the `fork` start method. Since this library
|
|
makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization,
|
|
this led to problems (deadlocks) when used with `multiprocessing`. This version now allows to
|
|
disable the parallelism, and will warn you if this is necessary.
|
|
- And a lot of other improvements, and fixes.
|
|
|
|
### Fixed
|
|
- [#286]: Fix various crash when training a BPE model
|
|
- [#309]: Fixed a few bugs related to additional vocabulary/tokens
|
|
|
|
### Added
|
|
- [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...).
|
|
This adds some methods to easily save/load an entire tokenizer (`from_str`, `from_file`).
|
|
- [#273]: `Tokenizer` and its parts are now pickable
|
|
- [#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure
|
|
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
|
|
`enable_padding(pad_to_multiple_of=8)` for example.
|
|
- [#298]: Ability to get the currently set truncation/padding params
|
|
- [#311]: Ability to enable/disable the parallelism using the `TOKENIZERS_PARALLELISM` environment
|
|
variable. This is especially usefull when using `multiprocessing` capabilities, with the `fork`
|
|
start method, which happens to be the default on Linux systems. Without disabling the parallelism,
|
|
the process dead-locks while encoding. (Cf [#187] for more information)
|
|
|
|
### Changed
|
|
- Improved errors generated during truncation: When the provided max length is too low are
|
|
now handled properly.
|
|
- [#249] `encode` and `encode_batch` now accept pre-tokenized inputs. When the input is pre-tokenized,
|
|
the argument `is_pretokenized=True` must be specified.
|
|
- [#276]: Improve BPE training speeds, by reading files sequentially, but parallelizing the
|
|
processing of each file
|
|
- [#280]: Use `onig` for byte-level pre-tokenization to remove all the differences with the original
|
|
implementation from GPT-2
|
|
- [#309]: Improved the management of the additional vocabulary. This introduces an option
|
|
`normalized`, controlling whether a token should be extracted from the normalized version of the
|
|
input text.
|
|
|
|
## [0.7.0]
|
|
|
|
### Changed
|
|
- Only one progress bar while reading files during training. This is better for use-cases with
|
|
a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
|
|
size of each file before starting to actually read these files, as this process could take really
|
|
long.
|
|
- [#193]: `encode` and `encode_batch` now take a new optional argument, specifying whether we
|
|
should add the special tokens. This is activated by default.
|
|
- [#197]: `original_str` and `normalized_str` have been removed from the `Encoding` returned by
|
|
`encode` and `encode_batch`. This brings a reduction of 70% of the memory footprint.
|
|
- [#197]: The offsets provided on `Encoding` are now relative to the original string, and not the
|
|
normalized one anymore.
|
|
- The added token given to `add_special_tokens` or `add_tokens` on a `Tokenizer`, or while using
|
|
`train(special_tokens=...)` can now be instances of `AddedToken` to provide more control over these
|
|
tokens.
|
|
- [#136]: Updated Pyo3 version
|
|
- [#136]: Static methods `Model.from_files` and `Model.empty` are removed in favor of using
|
|
constructors.
|
|
- [#239]: `CharBPETokenizer` now corresponds to OpenAI GPT BPE implementation by default.
|
|
|
|
### Added
|
|
- [#188]: `ByteLevel` is also a `PostProcessor` now and handles trimming the offsets if activated.
|
|
This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
|
|
whitespaces are part of the actual token.
|
|
It has been added to `ByteLevelBPETokenizer` but it is off by default (`trim_offsets=False`).
|
|
- [#236]: `RobertaProcessing` also handles trimming the offsets.
|
|
- [#234]: New alignment mappings on the `Encoding`. Provide methods to easily convert between `char`
|
|
or `word` (input space) and `token` (output space).
|
|
- `post_process` can be called on the `Tokenizer`
|
|
- [#208]: Ability to retrieve the vocabulary from the `Tokenizer` with
|
|
`get_vocab(with_added_tokens: bool)`
|
|
- [#136] Models can now be instantiated through object constructors.
|
|
|
|
### Fixed
|
|
- [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
|
|
- when `add_prefix_space=True`
|
|
- [#156]: when a Unicode character gets split-up in multiple byte-level characters
|
|
- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
|
|
- [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if
|
|
not advised, but that's not the question).
|
|
- [#205]: Trim the decoded string in `BPEDecoder` used by `CharBPETokenizer`
|
|
|
|
### How to migrate
|
|
- Add the `ByteLevel` `PostProcessor` to your byte-level BPE tokenizers if relevant. If you are
|
|
using `ByteLevelBPETokenizer`, this option is disabled by default (`trim_offsets=False`).
|
|
- `BertWordPieceTokenizer` option to `add_special_tokens` must now be given to `encode` or
|
|
`encode_batch`
|
|
- Access to the `original_str` on the `Encoding` has been removed. The original string is the input
|
|
of `encode` so it didn't make sense to keep it here.
|
|
- No need to call `original_str.offsets(offsets[N])` to convert offsets to the original string. They
|
|
are now relative to the original string by default.
|
|
- Access to the `normalized_str` on the `Encoding` has been removed. Can be retrieved by calling
|
|
`normalize(sequence)` on the `Tokenizer`
|
|
- Change `Model.from_files` and `Model.empty` to use constructor. The model constructor should take
|
|
the same arguments as the old methods. (ie `BPE(vocab, merges)` or `BPE()`)
|
|
- If you were using the `CharBPETokenizer` and want to keep the same behavior as before, set
|
|
`bert_normalizer=False` and `split_on_whitespace_only=True`.
|
|
|
|
## [0.6.0]
|
|
|
|
### Changed
|
|
- [#165]: Big improvements in speed for BPE (Both training and tokenization)
|
|
|
|
### Fixed
|
|
- [#160]: Some default tokens were missing from `BertWordPieceTokenizer`
|
|
- [#156]: There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got
|
|
split up in multiple bytes.
|
|
- [#174]: The `longest_first` truncation strategy had a bug
|
|
|
|
## [0.5.2]
|
|
- [#163]: Do not open all files directly while training
|
|
|
|
### Fixed
|
|
- We introduced a bug related to the saving of the WordPiece model in 0.5.1: The `vocab.txt` file
|
|
was named `vocab.json`. This is now fixed.
|
|
- The `WordLevel` model was also saving its vocabulary to the wrong format.
|
|
|
|
## [0.5.1]
|
|
|
|
### Changed
|
|
- `name` argument is now optional when saving a `Model`'s vocabulary. When the name is not
|
|
specified, the files get a more generic naming, like `vocab.json` or `merges.txt`.
|
|
|
|
## [0.5.0]
|
|
|
|
### Changed
|
|
- [#145]: `BertWordPieceTokenizer` now cleans up some tokenization artifacts while decoding
|
|
- [#149]: `ByteLevelBPETokenizer` now has `dropout`.
|
|
- `do_lowercase` has been changed to `lowercase` for consistency between the different tokenizers.
|
|
(Especially `ByteLevelBPETokenizer` and `CharBPETokenizer`)
|
|
- [#139]: Expose `__len__` on `Encoding`
|
|
- Improved padding performances.
|
|
|
|
### Added
|
|
- Added a new `Strip` normalizer
|
|
|
|
### Fixed
|
|
- [#145]: Decoding was buggy on `BertWordPieceTokenizer`.
|
|
- [#152]: Some documentation and examples were still using the old `BPETokenizer`
|
|
|
|
### How to migrate
|
|
- Use `lowercase` when initializing `ByteLevelBPETokenizer` or `CharBPETokenizer` instead of
|
|
`do_lowercase`.
|
|
|
|
## [0.4.2]
|
|
|
|
### Fixed
|
|
- [#137]: Fix a bug in the class `WordPieceTrainer` that prevented `BertWordPieceTokenizer` from
|
|
being trained.
|
|
|
|
## [0.4.1]
|
|
|
|
### Fixed
|
|
- [#134]: Fix a bug related to the punctuation in BertWordPieceTokenizer
|
|
|
|
## [0.4.0]
|
|
|
|
### Changed
|
|
- [#131]: Replaced all .new() class methods by a proper __new__ implementation
|
|
- Improved typings
|
|
|
|
### How to migrate
|
|
- Remove all `.new` on all classe instanciations
|
|
|
|
## [0.3.0]
|
|
|
|
### Changed
|
|
- BPETokenizer has been renamed to CharBPETokenizer for clarity.
|
|
- Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets
|
|
truncated, we provide a list of overflowing `Encoding` that are ready to be processed by a language
|
|
model, just as the main `Encoding`.
|
|
- Provide mapping to the original string offsets using:
|
|
```
|
|
output = tokenizer.encode(...)
|
|
print(output.original_str.offsets(output.offsets[3]))
|
|
```
|
|
- [#99]: Exposed the vocabulary size on all tokenizers
|
|
|
|
### Added
|
|
- Added `CharDelimiterSplit`: a new `PreTokenizer` that allows splitting sequences on the given
|
|
delimiter (Works like `.split(delimiter)`)
|
|
- Added `WordLevel`: a new model that simply maps `tokens` to their `ids`.
|
|
|
|
### Fixed
|
|
- Fix a bug with IndexableString
|
|
- Fix a bug with truncation
|
|
|
|
### How to migrate
|
|
- Rename `BPETokenizer` to `CharBPETokenizer`
|
|
- `Encoding.overflowing` is now a List instead of a `Optional[Encoding]`
|
|
|
|
## [0.2.1]
|
|
|
|
### Fixed
|
|
- Fix a bug with the IDs associated with added tokens.
|
|
- Fix a bug that was causing crashes in Python 3.5
|
|
|
|
[#1096]: https://github.com/huggingface/tokenizers/pull/1096
|
|
[#1072]: https://github.com/huggingface/tokenizers/pull/1072
|
|
[#956]: https://github.com/huggingface/tokenizers/pull/956
|
|
[#1008]: https://github.com/huggingface/tokenizers/pull/1008
|
|
[#1009]: https://github.com/huggingface/tokenizers/pull/1009
|
|
[#1047]: https://github.com/huggingface/tokenizers/pull/1047
|
|
[#1055]: https://github.com/huggingface/tokenizers/pull/1055
|
|
[#1051]: https://github.com/huggingface/tokenizers/pull/1051
|
|
[#1052]: https://github.com/huggingface/tokenizers/pull/1052
|
|
[#938]: https://github.com/huggingface/tokenizers/pull/938
|
|
[#939]: https://github.com/huggingface/tokenizers/pull/939
|
|
[#952]: https://github.com/huggingface/tokenizers/pull/952
|
|
[#954]: https://github.com/huggingface/tokenizers/pull/954
|
|
[#962]: https://github.com/huggingface/tokenizers/pull/962
|
|
[#961]: https://github.com/huggingface/tokenizers/pull/961
|
|
[#960]: https://github.com/huggingface/tokenizers/pull/960
|
|
[#919]: https://github.com/huggingface/tokenizers/pull/919
|
|
[#916]: https://github.com/huggingface/tokenizers/pull/916
|
|
[#895]: https://github.com/huggingface/tokenizers/pull/895
|
|
[#884]: https://github.com/huggingface/tokenizers/pull/884
|
|
[#882]: https://github.com/huggingface/tokenizers/pull/882
|
|
[#868]: https://github.com/huggingface/tokenizers/pull/868
|
|
[#860]: https://github.com/huggingface/tokenizers/pull/860
|
|
[#850]: https://github.com/huggingface/tokenizers/pull/850
|
|
[#844]: https://github.com/huggingface/tokenizers/pull/844
|
|
[#845]: https://github.com/huggingface/tokenizers/pull/845
|
|
[#851]: https://github.com/huggingface/tokenizers/pull/851
|
|
[#585]: https://github.com/huggingface/tokenizers/pull/585
|
|
[#793]: https://github.com/huggingface/tokenizers/pull/793
|
|
[#780]: https://github.com/huggingface/tokenizers/pull/780
|
|
[#770]: https://github.com/huggingface/tokenizers/pull/770
|
|
[#762]: https://github.com/huggingface/tokenizers/pull/762
|
|
[#718]: https://github.com/huggingface/tokenizers/pull/718
|
|
[#714]: https://github.com/huggingface/tokenizers/pull/714
|
|
[#707]: https://github.com/huggingface/tokenizers/pull/707
|
|
[#693]: https://github.com/huggingface/tokenizers/pull/693
|
|
[#686]: https://github.com/huggingface/tokenizers/pull/686
|
|
[#674]: https://github.com/huggingface/tokenizers/pull/674
|
|
[#657]: https://github.com/huggingface/tokenizers/pull/657
|
|
[#656]: https://github.com/huggingface/tokenizers/pull/656
|
|
[#652]: https://github.com/huggingface/tokenizers/pull/652
|
|
[#621]: https://github.com/huggingface/tokenizers/pull/621
|
|
[#620]: https://github.com/huggingface/tokenizers/pull/620
|
|
[#618]: https://github.com/huggingface/tokenizers/pull/618
|
|
[#617]: https://github.com/huggingface/tokenizers/pull/617
|
|
[#616]: https://github.com/huggingface/tokenizers/pull/616
|
|
[#590]: https://github.com/huggingface/tokenizers/pull/590
|
|
[#574]: https://github.com/huggingface/tokenizers/pull/574
|
|
[#544]: https://github.com/huggingface/tokenizers/pull/544
|
|
[#542]: https://github.com/huggingface/tokenizers/pull/542
|
|
[#539]: https://github.com/huggingface/tokenizers/pull/539
|
|
[#538]: https://github.com/huggingface/tokenizers/pull/538
|
|
[#533]: https://github.com/huggingface/tokenizers/pull/533
|
|
[#530]: https://github.com/huggingface/tokenizers/pull/530
|
|
[#519]: https://github.com/huggingface/tokenizers/pull/519
|
|
[#509]: https://github.com/huggingface/tokenizers/pull/509
|
|
[#508]: https://github.com/huggingface/tokenizers/pull/508
|
|
[#506]: https://github.com/huggingface/tokenizers/pull/506
|
|
[#500]: https://github.com/huggingface/tokenizers/pull/500
|
|
[#498]: https://github.com/huggingface/tokenizers/pull/498
|
|
[#492]: https://github.com/huggingface/tokenizers/pull/492
|
|
[#481]: https://github.com/huggingface/tokenizers/pull/481
|
|
[#480]: https://github.com/huggingface/tokenizers/pull/480
|
|
[#477]: https://github.com/huggingface/tokenizers/pull/477
|
|
[#476]: https://github.com/huggingface/tokenizers/pull/476
|
|
[#470]: https://github.com/huggingface/tokenizers/pull/470
|
|
[#464]: https://github.com/huggingface/tokenizers/pull/464
|
|
[#459]: https://github.com/huggingface/tokenizers/pull/459
|
|
[#420]: https://github.com/huggingface/tokenizers/pull/420
|
|
[#417]: https://github.com/huggingface/tokenizers/pull/417
|
|
[#416]: https://github.com/huggingface/tokenizers/pull/416
|
|
[#403]: https://github.com/huggingface/tokenizers/pull/403
|
|
[#394]: https://github.com/huggingface/tokenizers/pull/394
|
|
[#389]: https://github.com/huggingface/tokenizers/pull/389
|
|
[#379]: https://github.com/huggingface/tokenizers/pull/379
|
|
[#378]: https://github.com/huggingface/tokenizers/pull/378
|
|
[#363]: https://github.com/huggingface/tokenizers/pull/363
|
|
[#362]: https://github.com/huggingface/tokenizers/pull/362
|
|
[#360]: https://github.com/huggingface/tokenizers/pull/360
|
|
[#355]: https://github.com/huggingface/tokenizers/pull/355
|
|
[#333]: https://github.com/huggingface/tokenizers/pull/333
|
|
[#330]: https://github.com/huggingface/tokenizers/pull/330
|
|
[#329]: https://github.com/huggingface/tokenizers/pull/329
|
|
[#311]: https://github.com/huggingface/tokenizers/pull/311
|
|
[#309]: https://github.com/huggingface/tokenizers/pull/309
|
|
[#292]: https://github.com/huggingface/tokenizers/pull/292
|
|
[#289]: https://github.com/huggingface/tokenizers/pull/289
|
|
[#286]: https://github.com/huggingface/tokenizers/pull/286
|
|
[#280]: https://github.com/huggingface/tokenizers/pull/280
|
|
[#276]: https://github.com/huggingface/tokenizers/pull/276
|
|
[#273]: https://github.com/huggingface/tokenizers/pull/273
|
|
[#272]: https://github.com/huggingface/tokenizers/pull/272
|
|
[#249]: https://github.com/huggingface/tokenizers/pull/249
|
|
[#239]: https://github.com/huggingface/tokenizers/pull/239
|
|
[#236]: https://github.com/huggingface/tokenizers/pull/236
|
|
[#234]: https://github.com/huggingface/tokenizers/pull/234
|
|
[#208]: https://github.com/huggingface/tokenizers/pull/208
|
|
[#205]: https://github.com/huggingface/tokenizers/issues/205
|
|
[#197]: https://github.com/huggingface/tokenizers/pull/197
|
|
[#193]: https://github.com/huggingface/tokenizers/pull/193
|
|
[#190]: https://github.com/huggingface/tokenizers/pull/190
|
|
[#188]: https://github.com/huggingface/tokenizers/pull/188
|
|
[#187]: https://github.com/huggingface/tokenizers/issues/187
|
|
[#175]: https://github.com/huggingface/tokenizers/issues/175
|
|
[#174]: https://github.com/huggingface/tokenizers/issues/174
|
|
[#165]: https://github.com/huggingface/tokenizers/pull/165
|
|
[#163]: https://github.com/huggingface/tokenizers/issues/163
|
|
[#160]: https://github.com/huggingface/tokenizers/issues/160
|
|
[#156]: https://github.com/huggingface/tokenizers/pull/156
|
|
[#152]: https://github.com/huggingface/tokenizers/issues/152
|
|
[#149]: https://github.com/huggingface/tokenizers/issues/149
|
|
[#145]: https://github.com/huggingface/tokenizers/issues/145
|
|
[#139]: https://github.com/huggingface/tokenizers/issues/139
|
|
[#137]: https://github.com/huggingface/tokenizers/issues/137
|
|
[#134]: https://github.com/huggingface/tokenizers/issues/134
|
|
[#131]: https://github.com/huggingface/tokenizers/issues/131
|
|
[#99]: https://github.com/huggingface/tokenizers/pull/99
|