Update changelogs and bump version for python release

This commit is contained in:
Anthony MOI
2020-06-03 18:27:29 -04:00
parent 950b23c89b
commit d00ac60162
6 changed files with 17 additions and 8 deletions

View File

@@ -4,16 +4,18 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.8.0.dev1] ## [0.8.0.dev2]
### Added ### Fixed
- [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...). - [#286]: Fix various crash when training a BPE model
This adds some methods to easily save/load an entire tokenizer (`from_str`, `from_file`).
### Added ### Added
- [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...). - [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...).
This adds some methods to easily save/load an entire tokenizer (`from_str`, `from_file`). This adds some methods to easily save/load an entire tokenizer (`from_str`, `from_file`).
- [#273]: `Tokenizer` and its parts are now pickable - [#273]: `Tokenizer` and its parts are now pickable
- [#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
`enable_padding(pad_to_multiple_of=8)` for example.
### Changed ### Changed
- Improved errors generated during truncation: When the provided max length is too low are - Improved errors generated during truncation: When the provided max length is too low are
@@ -183,6 +185,8 @@ delimiter (Works like `.split(delimiter)`)
- Fix a bug with the IDs associated with added tokens. - Fix a bug with the IDs associated with added tokens.
- Fix a bug that was causing crashes in Python 3.5 - Fix a bug that was causing crashes in Python 3.5
[#289]: https://github.com/huggingface/tokenizers/pull/289
[#286]: https://github.com/huggingface/tokenizers/pull/286
[#280]: https://github.com/huggingface/tokenizers/pull/280 [#280]: https://github.com/huggingface/tokenizers/pull/280
[#276]: https://github.com/huggingface/tokenizers/pull/276 [#276]: https://github.com/huggingface/tokenizers/pull/276
[#273]: https://github.com/huggingface/tokenizers/pull/273 [#273]: https://github.com/huggingface/tokenizers/pull/273

View File

@@ -622,7 +622,7 @@ dependencies = [
[[package]] [[package]]
name = "tokenizers-python" name = "tokenizers-python"
version = "0.8.0-dev1" version = "0.8.0-dev2"
dependencies = [ dependencies = [
"pyo3 0.9.2 (registry+https://github.com/rust-lang/crates.io-index)", "pyo3 0.9.2 (registry+https://github.com/rust-lang/crates.io-index)",
"rayon 1.3.0 (registry+https://github.com/rust-lang/crates.io-index)", "rayon 1.3.0 (registry+https://github.com/rust-lang/crates.io-index)",

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "tokenizers-python" name = "tokenizers-python"
version = "0.8.0-dev1" version = "0.8.0-dev2"
authors = ["Anthony MOI <m.anthony.moi@gmail.com>"] authors = ["Anthony MOI <m.anthony.moi@gmail.com>"]
edition = "2018" edition = "2018"

View File

@@ -6,7 +6,7 @@ extras["testing"] = ["pytest"]
setup( setup(
name="tokenizers", name="tokenizers",
version="0.8.0.dev1", version="0.8.0.dev2",
description="Fast and Customizable Tokenizers", description="Fast and Customizable Tokenizers",
long_description=open("README.md", "r", encoding="utf-8").read(), long_description=open("README.md", "r", encoding="utf-8").read(),
long_description_content_type="text/markdown", long_description_content_type="text/markdown",

View File

@@ -1,4 +1,4 @@
__version__ = "0.8.0.dev1" __version__ = "0.8.0.dev2"
from typing import Tuple, Union, Tuple, List from typing import Tuple, Union, Tuple, List

View File

@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Fixed ### Fixed
- [#236]: Fix a bug with offsets being shifted when there are sub-sequences (Usually with - [#236]: Fix a bug with offsets being shifted when there are sub-sequences (Usually with
special tokens and/or added tokens in the sequence). special tokens and/or added tokens in the sequence).
- [#286]: Fix various crash when training a BPE model
### Changed ### Changed
- [#234]: Completely changed the alignement mappings available on `Encoding`. Previous mappings - [#234]: Completely changed the alignement mappings available on `Encoding`. Previous mappings
@@ -35,6 +36,8 @@ implementation from GPT-2
on this front. on this front.
- [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...) - [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...)
using serde. It is now easy to save/load an entire tokenizer. using serde. It is now easy to save/load an entire tokenizer.
- [#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure
activation of the Tensor Cores, while ensuring padding to a multiple of 8.
### How to migrate ### How to migrate
- Replace any `XXX_to_YYY_offsets()` method call by any of the new ones. - Replace any `XXX_to_YYY_offsets()` method call by any of the new ones.
@@ -109,6 +112,8 @@ advised, but that's not the question)
split up in multiple bytes split up in multiple bytes
- [#174]: The `LongestFirst` truncation strategy had a bug - [#174]: The `LongestFirst` truncation strategy had a bug
[#289]: https://github.com/huggingface/tokenizers/pull/289
[#286]: https://github.com/huggingface/tokenizers/pull/286
[#280]: https://github.com/huggingface/tokenizers/pull/280 [#280]: https://github.com/huggingface/tokenizers/pull/280
[#276]: https://github.com/huggingface/tokenizers/pull/276 [#276]: https://github.com/huggingface/tokenizers/pull/276
[#272]: https://github.com/huggingface/tokenizers/pull/272 [#272]: https://github.com/huggingface/tokenizers/pull/272