mirror of
https://github.com/mii443/tokenizers.git
synced 2025-12-17 01:28:46 +00:00
Update changelogs and bump version for python release
This commit is contained in:
@@ -4,16 +4,18 @@ All notable changes to this project will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
## [0.8.0.dev1]
|
## [0.8.0.dev2]
|
||||||
|
|
||||||
### Added
|
### Fixed
|
||||||
- [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...).
|
- [#286]: Fix various crash when training a BPE model
|
||||||
This adds some methods to easily save/load an entire tokenizer (`from_str`, `from_file`).
|
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
- [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...).
|
- [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...).
|
||||||
This adds some methods to easily save/load an entire tokenizer (`from_str`, `from_file`).
|
This adds some methods to easily save/load an entire tokenizer (`from_str`, `from_file`).
|
||||||
- [#273]: `Tokenizer` and its parts are now pickable
|
- [#273]: `Tokenizer` and its parts are now pickable
|
||||||
|
- [#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure
|
||||||
|
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
|
||||||
|
`enable_padding(pad_to_multiple_of=8)` for example.
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
- Improved errors generated during truncation: When the provided max length is too low are
|
- Improved errors generated during truncation: When the provided max length is too low are
|
||||||
@@ -183,6 +185,8 @@ delimiter (Works like `.split(delimiter)`)
|
|||||||
- Fix a bug with the IDs associated with added tokens.
|
- Fix a bug with the IDs associated with added tokens.
|
||||||
- Fix a bug that was causing crashes in Python 3.5
|
- Fix a bug that was causing crashes in Python 3.5
|
||||||
|
|
||||||
|
[#289]: https://github.com/huggingface/tokenizers/pull/289
|
||||||
|
[#286]: https://github.com/huggingface/tokenizers/pull/286
|
||||||
[#280]: https://github.com/huggingface/tokenizers/pull/280
|
[#280]: https://github.com/huggingface/tokenizers/pull/280
|
||||||
[#276]: https://github.com/huggingface/tokenizers/pull/276
|
[#276]: https://github.com/huggingface/tokenizers/pull/276
|
||||||
[#273]: https://github.com/huggingface/tokenizers/pull/273
|
[#273]: https://github.com/huggingface/tokenizers/pull/273
|
||||||
|
|||||||
2
bindings/python/Cargo.lock
generated
2
bindings/python/Cargo.lock
generated
@@ -622,7 +622,7 @@ dependencies = [
|
|||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "tokenizers-python"
|
name = "tokenizers-python"
|
||||||
version = "0.8.0-dev1"
|
version = "0.8.0-dev2"
|
||||||
dependencies = [
|
dependencies = [
|
||||||
"pyo3 0.9.2 (registry+https://github.com/rust-lang/crates.io-index)",
|
"pyo3 0.9.2 (registry+https://github.com/rust-lang/crates.io-index)",
|
||||||
"rayon 1.3.0 (registry+https://github.com/rust-lang/crates.io-index)",
|
"rayon 1.3.0 (registry+https://github.com/rust-lang/crates.io-index)",
|
||||||
|
|||||||
@@ -1,6 +1,6 @@
|
|||||||
[package]
|
[package]
|
||||||
name = "tokenizers-python"
|
name = "tokenizers-python"
|
||||||
version = "0.8.0-dev1"
|
version = "0.8.0-dev2"
|
||||||
authors = ["Anthony MOI <m.anthony.moi@gmail.com>"]
|
authors = ["Anthony MOI <m.anthony.moi@gmail.com>"]
|
||||||
edition = "2018"
|
edition = "2018"
|
||||||
|
|
||||||
|
|||||||
@@ -6,7 +6,7 @@ extras["testing"] = ["pytest"]
|
|||||||
|
|
||||||
setup(
|
setup(
|
||||||
name="tokenizers",
|
name="tokenizers",
|
||||||
version="0.8.0.dev1",
|
version="0.8.0.dev2",
|
||||||
description="Fast and Customizable Tokenizers",
|
description="Fast and Customizable Tokenizers",
|
||||||
long_description=open("README.md", "r", encoding="utf-8").read(),
|
long_description=open("README.md", "r", encoding="utf-8").read(),
|
||||||
long_description_content_type="text/markdown",
|
long_description_content_type="text/markdown",
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
__version__ = "0.8.0.dev1"
|
__version__ = "0.8.0.dev2"
|
||||||
|
|
||||||
from typing import Tuple, Union, Tuple, List
|
from typing import Tuple, Union, Tuple, List
|
||||||
|
|
||||||
|
|||||||
@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|||||||
### Fixed
|
### Fixed
|
||||||
- [#236]: Fix a bug with offsets being shifted when there are sub-sequences (Usually with
|
- [#236]: Fix a bug with offsets being shifted when there are sub-sequences (Usually with
|
||||||
special tokens and/or added tokens in the sequence).
|
special tokens and/or added tokens in the sequence).
|
||||||
|
- [#286]: Fix various crash when training a BPE model
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
- [#234]: Completely changed the alignement mappings available on `Encoding`. Previous mappings
|
- [#234]: Completely changed the alignement mappings available on `Encoding`. Previous mappings
|
||||||
@@ -35,6 +36,8 @@ implementation from GPT-2
|
|||||||
on this front.
|
on this front.
|
||||||
- [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...)
|
- [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...)
|
||||||
using serde. It is now easy to save/load an entire tokenizer.
|
using serde. It is now easy to save/load an entire tokenizer.
|
||||||
|
- [#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure
|
||||||
|
activation of the Tensor Cores, while ensuring padding to a multiple of 8.
|
||||||
|
|
||||||
### How to migrate
|
### How to migrate
|
||||||
- Replace any `XXX_to_YYY_offsets()` method call by any of the new ones.
|
- Replace any `XXX_to_YYY_offsets()` method call by any of the new ones.
|
||||||
@@ -109,6 +112,8 @@ advised, but that's not the question)
|
|||||||
split up in multiple bytes
|
split up in multiple bytes
|
||||||
- [#174]: The `LongestFirst` truncation strategy had a bug
|
- [#174]: The `LongestFirst` truncation strategy had a bug
|
||||||
|
|
||||||
|
[#289]: https://github.com/huggingface/tokenizers/pull/289
|
||||||
|
[#286]: https://github.com/huggingface/tokenizers/pull/286
|
||||||
[#280]: https://github.com/huggingface/tokenizers/pull/280
|
[#280]: https://github.com/huggingface/tokenizers/pull/280
|
||||||
[#276]: https://github.com/huggingface/tokenizers/pull/276
|
[#276]: https://github.com/huggingface/tokenizers/pull/276
|
||||||
[#272]: https://github.com/huggingface/tokenizers/pull/272
|
[#272]: https://github.com/huggingface/tokenizers/pull/272
|
||||||
|
|||||||
Reference in New Issue
Block a user