mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Preparing 0.12
release. (#967)
* Preparing `0.12` release. * Fix click version: https://github.com/psf/black/issues/2964
This commit is contained in:
2
.github/workflows/python.yml
vendored
2
.github/workflows/python.yml
vendored
@ -107,7 +107,7 @@ jobs:
|
|||||||
working-directory: ./bindings/python
|
working-directory: ./bindings/python
|
||||||
run: |
|
run: |
|
||||||
source .env/bin/activate
|
source .env/bin/activate
|
||||||
pip install black==20.8b1
|
pip install black==20.8b1 click==8.0.4
|
||||||
make check-style
|
make check-style
|
||||||
|
|
||||||
- name: Run tests
|
- name: Run tests
|
||||||
|
@ -1,3 +1,15 @@
|
|||||||
|
## [0.12.0]
|
||||||
|
|
||||||
|
Bump minor version because of a breaking change.
|
||||||
|
Using `0.12` to match other bindings.
|
||||||
|
|
||||||
|
- [#938] **Breaking change**. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
|
||||||
|
- [#939] Making the regex in `ByteLevel` pre_tokenizer optional (necessary for BigScience)
|
||||||
|
|
||||||
|
- [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
|
||||||
|
- [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
|
||||||
|
- [#961] Added link for Ruby port of `tokenizers`
|
||||||
|
|
||||||
# [0.8.0](https://github.com/huggingface/tokenizers/compare/node-v0.7.0...node-v0.8.0) (2021-09-02)
|
# [0.8.0](https://github.com/huggingface/tokenizers/compare/node-v0.7.0...node-v0.8.0) (2021-09-02)
|
||||||
|
|
||||||
### BREACKING CHANGES
|
### BREACKING CHANGES
|
||||||
@ -142,3 +154,12 @@ The files must now be provided first when calling `tokenizer.train(files, traine
|
|||||||
- Fix default special tokens in `BertWordPieceTokenizer` ([10e2d28](https://github.com/huggingface/tokenizers/commit/10e2d286caf517f0977c04cf8e1924aed90403c9))
|
- Fix default special tokens in `BertWordPieceTokenizer` ([10e2d28](https://github.com/huggingface/tokenizers/commit/10e2d286caf517f0977c04cf8e1924aed90403c9))
|
||||||
- Fix return type of `getSpecialTokensMask` on `Encoding` ([9770be5](https://github.com/huggingface/tokenizers/commit/9770be566175dc9c44dd7dcaa00a57d0e4ca632b))
|
- Fix return type of `getSpecialTokensMask` on `Encoding` ([9770be5](https://github.com/huggingface/tokenizers/commit/9770be566175dc9c44dd7dcaa00a57d0e4ca632b))
|
||||||
- Actually add special tokens in tokenizers implementations ([acef252](https://github.com/huggingface/tokenizers/commit/acef252dacc43adc414175cfc325668ad1488753))
|
- Actually add special tokens in tokenizers implementations ([acef252](https://github.com/huggingface/tokenizers/commit/acef252dacc43adc414175cfc325668ad1488753))
|
||||||
|
|
||||||
|
|
||||||
|
[#938]: https://github.com/huggingface/tokenizers/pull/938
|
||||||
|
[#939]: https://github.com/huggingface/tokenizers/pull/939
|
||||||
|
[#952]: https://github.com/huggingface/tokenizers/pull/952
|
||||||
|
[#954]: https://github.com/huggingface/tokenizers/pull/954
|
||||||
|
[#962]: https://github.com/huggingface/tokenizers/pull/962
|
||||||
|
[#961]: https://github.com/huggingface/tokenizers/pull/961
|
||||||
|
[#960]: https://github.com/huggingface/tokenizers/pull/960
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"name": "tokenizers",
|
"name": "tokenizers",
|
||||||
"version": "0.8.3",
|
"version": "0.12.0",
|
||||||
"description": "",
|
"description": "",
|
||||||
"main": "./dist/index.js",
|
"main": "./dist/index.js",
|
||||||
"types": "./dist/index.d.ts",
|
"types": "./dist/index.d.ts",
|
||||||
|
@ -4,6 +4,18 @@ All notable changes to this project will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
|
## [0.12.0]
|
||||||
|
|
||||||
|
Bump minor version because of a breaking change.
|
||||||
|
|
||||||
|
- [#938] **Breaking change**. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
|
||||||
|
- [#939] Making the regex in `ByteLevel` pre_tokenizer optional (necessary for BigScience)
|
||||||
|
|
||||||
|
- [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
|
||||||
|
- [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
|
||||||
|
- [#962] Fix tests for python 3.10
|
||||||
|
- [#961] Added link for Ruby port of `tokenizers`
|
||||||
|
|
||||||
## [0.11.6]
|
## [0.11.6]
|
||||||
|
|
||||||
- [#919] Fixing single_word AddedToken. (regression from 0.11.2)
|
- [#919] Fixing single_word AddedToken. (regression from 0.11.2)
|
||||||
@ -360,6 +372,13 @@ delimiter (Works like `.split(delimiter)`)
|
|||||||
- Fix a bug that was causing crashes in Python 3.5
|
- Fix a bug that was causing crashes in Python 3.5
|
||||||
|
|
||||||
|
|
||||||
|
[#938]: https://github.com/huggingface/tokenizers/pull/938
|
||||||
|
[#939]: https://github.com/huggingface/tokenizers/pull/939
|
||||||
|
[#952]: https://github.com/huggingface/tokenizers/pull/952
|
||||||
|
[#954]: https://github.com/huggingface/tokenizers/pull/954
|
||||||
|
[#962]: https://github.com/huggingface/tokenizers/pull/962
|
||||||
|
[#961]: https://github.com/huggingface/tokenizers/pull/961
|
||||||
|
[#960]: https://github.com/huggingface/tokenizers/pull/960
|
||||||
[#919]: https://github.com/huggingface/tokenizers/pull/919
|
[#919]: https://github.com/huggingface/tokenizers/pull/919
|
||||||
[#916]: https://github.com/huggingface/tokenizers/pull/916
|
[#916]: https://github.com/huggingface/tokenizers/pull/916
|
||||||
[#895]: https://github.com/huggingface/tokenizers/pull/895
|
[#895]: https://github.com/huggingface/tokenizers/pull/895
|
||||||
|
@ -1,4 +1,4 @@
|
|||||||
__version__ = "0.11.6"
|
__version__ = "0.12.0"
|
||||||
|
|
||||||
from typing import Tuple, Union, Tuple, List
|
from typing import Tuple, Union, Tuple, List
|
||||||
from enum import Enum
|
from enum import Enum
|
||||||
|
@ -7,7 +7,7 @@ extras["docs"] = ["sphinx", "sphinx_rtd_theme", "setuptools_rust"]
|
|||||||
|
|
||||||
setup(
|
setup(
|
||||||
name="tokenizers",
|
name="tokenizers",
|
||||||
version="0.11.6",
|
version="0.12.0",
|
||||||
description="Fast and Customizable Tokenizers",
|
description="Fast and Customizable Tokenizers",
|
||||||
long_description=open("README.md", "r", encoding="utf-8").read(),
|
long_description=open("README.md", "r", encoding="utf-8").read(),
|
||||||
long_description_content_type="text/markdown",
|
long_description_content_type="text/markdown",
|
||||||
|
@ -4,6 +4,18 @@ All notable changes to this project will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
|
## [0.12.0]
|
||||||
|
|
||||||
|
Bump minor version because of a breaking change.
|
||||||
|
|
||||||
|
- [#938] **Breaking change**. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
|
||||||
|
- [#939] Making the regex in `ByteLevel` pre_tokenizer optional (necessary for BigScience)
|
||||||
|
|
||||||
|
- [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
|
||||||
|
- [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
|
||||||
|
- [#961] Added link for Ruby port of `tokenizers`
|
||||||
|
- [#960] Feature gate for `cli` and its `clap` dependency
|
||||||
|
|
||||||
## [0.11.3]
|
## [0.11.3]
|
||||||
|
|
||||||
- [#919] Fixing single_word AddedToken. (regression from 0.11.2)
|
- [#919] Fixing single_word AddedToken. (regression from 0.11.2)
|
||||||
@ -140,6 +152,13 @@ advised, but that's not the question)
|
|||||||
split up in multiple bytes
|
split up in multiple bytes
|
||||||
- [#174]: The `LongestFirst` truncation strategy had a bug
|
- [#174]: The `LongestFirst` truncation strategy had a bug
|
||||||
|
|
||||||
|
|
||||||
|
[#938]: https://github.com/huggingface/tokenizers/pull/938
|
||||||
|
[#939]: https://github.com/huggingface/tokenizers/pull/939
|
||||||
|
[#952]: https://github.com/huggingface/tokenizers/pull/952
|
||||||
|
[#954]: https://github.com/huggingface/tokenizers/pull/954
|
||||||
|
[#961]: https://github.com/huggingface/tokenizers/pull/961
|
||||||
|
[#960]: https://github.com/huggingface/tokenizers/pull/960
|
||||||
[#919]: https://github.com/huggingface/tokenizers/pull/919
|
[#919]: https://github.com/huggingface/tokenizers/pull/919
|
||||||
[#916]: https://github.com/huggingface/tokenizers/pull/916
|
[#916]: https://github.com/huggingface/tokenizers/pull/916
|
||||||
[#884]: https://github.com/huggingface/tokenizers/pull/884
|
[#884]: https://github.com/huggingface/tokenizers/pull/884
|
||||||
|
@ -2,7 +2,7 @@
|
|||||||
authors = ["Anthony MOI <m.anthony.moi@gmail.com>"]
|
authors = ["Anthony MOI <m.anthony.moi@gmail.com>"]
|
||||||
edition = "2018"
|
edition = "2018"
|
||||||
name = "tokenizers"
|
name = "tokenizers"
|
||||||
version = "0.11.3"
|
version = "0.12.0"
|
||||||
homepage = "https://github.com/huggingface/tokenizers"
|
homepage = "https://github.com/huggingface/tokenizers"
|
||||||
repository = "https://github.com/huggingface/tokenizers"
|
repository = "https://github.com/huggingface/tokenizers"
|
||||||
documentation = "https://docs.rs/tokenizers/"
|
documentation = "https://docs.rs/tokenizers/"
|
||||||
|
Reference in New Issue
Block a user