Commit Graph

1665 Commits

Author SHA1 Message Date
a4cf53f6a7 Update CHANGELOG. 2022-01-17 09:56:56 +01:00
ab9a2f3100 Update versions. 2022-01-17 09:40:01 +01:00
4a750f1a57 Fixing Punctuation deserialize without argument. (#882) 2022-01-17 09:27:22 +01:00
b18b572ed2 Bump shelljs from 0.8.4 to 0.8.5 in /bindings/node (#881)
Bumps [shelljs](https://github.com/shelljs/shelljs) from 0.8.4 to 0.8.5.
- [Release notes](https://github.com/shelljs/shelljs/releases)
- [Changelog](https://github.com/shelljs/shelljs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/shelljs/shelljs/compare/v0.8.4...v0.8.5)

---
updated-dependencies:
- dependency-name: shelljs
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-01-17 09:26:09 +01:00
cabbecb96c add python3.10 release (#877)
* add missing python3.9 classifier

* add python3.10 release

* run tests on 3.10

* Revert "run tests on 3.10"

This reverts commit ceed64249e54b6ec622b06c59bf47da7c6dfc1b0.
2022-01-12 09:42:13 +01:00
076319d542 Aho corasick version for many added tokens. (#871)
* Aho corasick version.

* Remove test file.

* Compile on `stable`.
2022-01-06 16:04:51 +01:00
fb837b4adb Fix wordlevel encode <unk> (#870)
* Fix wordlevel encode `<unk>`

* Better unit test name

* Refactor
2022-01-06 16:04:16 +01:00
8e0d66a254 New python version. 2022-01-04 14:58:02 +01:00
6972e49f1d Fix the clippy warnings. (#869) 2022-01-04 14:32:07 +01:00
1054e243e2 Fix invalid continuing subwrd prefix. (#864)
* Creating failing test for invalid continuing subwrd prefix.

* Test in rust + the associated fix.

* Clippy.

* Black.
2022-01-04 14:25:35 +01:00
4122a33f09 Fixing missing direction in TruncationParams. (#868) 2022-01-04 14:21:46 +01:00
7069988ffe Update to 0.11.1 2021-12-28 13:59:31 +01:00
152880ab3e Adding truncation_side within TruncationParams. (#860)
* Add truncation to enable_truncation

* Fix typo

* Adding truncation_side within `TruncationParams`.

* Node serialization of this direction param.

* Update the test.

* Fixing warnings/lint.

* Adding stuff (can't local debug :( )

* Slow loop... ;(

* Stub.py.

Co-authored-by: Niels Rogge <niels.rogge1@gmail.com>
2021-12-28 12:37:06 +01:00
c4c9de23a5 Feature: Handle invalid truncate direction (#858)
* refacto: TruncateDirection -> TruncationDirection

* feat(node): invalid direction will throw

* feat(python): invalid direction will throw

* Update bindings/node/lib/bindings/raw-encoding.test.ts

* Update bindings/python/tests/bindings/test_encoding.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2021-12-27 14:31:57 +01:00
38a85b2112 Last touches for conda hopefully
- Missng env activation for many linux + upload
2021-12-24 08:05:09 +01:00
943f4ef469 Preparing for 0.11.0 Re-release. (#856)
* Starting from master again.

Upgrade libssl everywhere on quay

Extra is ubuntu based (running the quay in a container).

making only extra run + attempt to fix ssl update.

Extra with newer openssl versions.

`-y`.

Use checkoint@v2 + remove `-` from environment name.

Debugging back the conda release..

Attempt to use `base` env.

3.7 requires `activate-environement: true.

MacOS and windows don't run on manylinux.

Remove yum on windows/macOs.

Miniconda doesn't like manylinux2014 anymore ?

Attempting different approach for manylinux + conda.

Use wget.

Extra bracet.

Executing $filename

Activate the env.

Activate the env on eevery step that requires it.

Openssl-devel.

Activating env for extracting version ?

Retest all workflows.

Manylinux2010 requires checkout@v1

Run on tag for extra and conda again.

openssl-devel.

* Putting back into deploy state.

* Adding links in CHANGELOG.

* Remove clippy from changelog.
2021-12-23 16:43:48 +01:00
04368b1998 Truncate Right (#841)
* feat(tokenizers): add truncate test case

* !feat(tokenizer): truncate right

* refacto(tokenizers): clippy

* feat(bindings): update bindings for truncate()

* fix(tokenizers): remove unsafe code

* refacto(tokenizers): truncate direction

* truncate direction enum
* compute parts ranges beforehand
* 2n space because encoding is dropped at the end of procedure
* update bindings
* add pip install in python bindings' make test

* fix(node): clippy asks to use unwrap_or_else

* fix(node): lint

* refacto(tokenizers): replace Vec<Range<usize>> by Vec<(usize, usize)>

* refacto(bindings): add match syntax

* refacto(tokenizers): use mem::replace instead of mem::swap

* refacto(tokenizers): assign value the normal way
2021-12-23 13:34:21 +01:00
362df327b0 Adding Decoders to the API doc in Python. (#845) 2021-12-20 10:53:58 +01:00
4759700da8 Fixing interaction between is_pretokenized and trim_offsets. (#844) 2021-12-20 10:53:46 +01:00
31dd4364f0 Feature gate http-deps (#850)
* Feature gate http-deps

* Default features cleanup

* Review fixups

* One more import fix
2021-12-20 10:53:09 +01:00
b240ccb68a Updating doc with real links. (#851)
* Updating doc with real links.

* Remove cache to make it build ?
2021-12-17 17:50:24 +01:00
c1100ec542 Clippy fixes. (#846)
* Clippy fixes.

* Drop support for Python 3.6

* Remove other 3.6

* Re-enabling caches for build (5h + seems too long and issue seems
solved)

https://github.com/actions/virtual-environments/issues/572

* `npm audit fix`.

* Fix yaml ?

* Pyarrow issue fixed: https://github.com/huggingface/datasets/pull/2268

* Installing dev libraries.

* Install python dev elsewhere ?

* Typo.

* No sudo.

* ...

* Testing the GH again.

* Maybe v2 will fix ?

* Fixing tests on MacOS Python 3.8+
2021-12-15 15:55:48 +01:00
1dc19e0dd4 Fix Python README example 2021-10-07 16:56:48 +02:00
b0ee27847f Python - Prepare for release 0.11.0 (#799) 2021-09-08 03:15:47 -04:00
0a37bd8d55 Attempt at fixing Conda builds
Ref #585
2021-09-08 08:56:58 +02:00
fd316bdc61 Update esaxx-rs to 0.1.7 to fix building on windows 2021-09-02 20:11:27 +02:00
36204c8dde Exclude node 15.x for windows 2021-09-02 16:11:41 +02:00
884bfb7970 Prepare node release (#794)
* Node - Update changelog for release

* Update node release to add v14 & v15

Co-authored-by: Huan (李卓桓) <zixia@zixia.net>

* Node - Update version number

* Node - Update dependencies

* Node - Lint

Co-authored-by: Huan (李卓桓) <zixia@zixia.net>
2021-09-02 09:58:01 -04:00
b8b584d4e5 Python - Pretty json saving defaults to true (#793)
* Python - Pretty json saving defaults to true

* Update changelog
2021-09-02 08:43:54 -04:00
23cf8c69ae Bump tar from 4.4.17 to 4.4.19 in /bindings/node (#792)
Bumps [tar](https://github.com/npm/node-tar) from 4.4.17 to 4.4.19.
- [Release notes](https://github.com/npm/node-tar/releases)
- [Changelog](https://github.com/npm/node-tar/blob/main/CHANGELOG.md)
- [Commits](https://github.com/npm/node-tar/compare/v4.4.17...v4.4.19)

---
updated-dependencies:
- dependency-name: tar
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-09-02 08:06:54 -04:00
e68aecc442 Python - Update Cargo.lock 2021-09-02 14:04:35 +02:00
c65b72dec7 Rust - Prepare for release 0.11.0 (#789) 2021-08-31 10:57:21 -04:00
35c96e5e3f Add tests for from_pretrained 2021-08-31 09:00:05 -04:00
ad7090a5c7 Improve READMEs for from_pretrained 2021-08-31 09:00:05 -04:00
a4d0f3dd18 Update docs for from_pretrained 2021-08-31 09:00:05 -04:00
528c9a532e Node - Add bindings to Tokenizer.from_pretrained 2021-08-31 09:00:05 -04:00
6f9e867330 Better export for FromPretrainedParameters 2021-08-31 09:00:05 -04:00
e44fdee4a1 Python - Add bindings to Tokenizer.from_pretrained 2021-08-31 09:00:05 -04:00
e71e5be64f Rust - Add from_pretrained on Tokenizer 2021-08-31 09:00:05 -04:00
e7dd6436dd Fix word level tokenizer determinism (#718)
* compare not only counts of words, but if equal also words themselves

* add missing semicolon

* Fix a few clippy warnings and imports

Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
2021-08-13 10:53:39 -04:00
5982498195 Switch git dependencies in Cargo.toml back to regular versions (#728)
* Switch git dependencies in Cargo.toml back to regular versions

rayon-cond turned out to be a rustc bug that has been fixed for a while
(see cuviper/rayon-cond#2), so we can revert the git dependency.

numpy has released the commit in question as part of 0.12.

* Also update Cargo.lock files

Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
2021-08-13 09:32:00 -04:00
e2bf8daa3a Add SplitDelimiterBehavior to Punctuation constructor (#657)
Resolves: #642
2021-08-13 09:19:23 -04:00
c1100dcbe3 Fix typo in documentation (#743)
* Doc - Fix typo (And instance of -> An instance of)

* Add missing text_signature for WordLevel.from_file

Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
2021-08-13 08:08:23 -04:00
71fb73e129 update lexical-core because 0.7.4 doesn't compile (#758)
* update lexical-core because 0.7.4 doesn't compile

Fix the issue as described in https://github.com/rust-lang/rust/issues/81654

* update lexical-core because 0.7.4 doesn't compile

Fix the issue as described in https://github.com/rust-lang/rust/issues/81654
2021-08-12 10:34:45 -04:00
6616e699f7 Expand documentation of UnigramTrainer (#770)
* Expand documentation of UnigramTrainer

* Put doc at the source

* Add signature

* make style

Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
2021-08-12 10:12:26 -04:00
da4c7b10e4 Add a way to specify the unknown token in SentencePieceUnigramTokenizer python implem (#762)
* add a way to specify the unknown token in `SentencePieceUnigramTokenizer`

* add test that verify that an exception is raised for the missing unknown token

* style

* add test tokens
2021-08-12 09:42:44 -04:00
46bed542fa Bump path-parse from 1.0.6 to 1.0.7 in /bindings/node (#774)
Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.6 to 1.0.7.
- [Release notes](https://github.com/jbgutierrez/path-parse/releases)
- [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7)

---
updated-dependencies:
- dependency-name: path-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-08-12 09:41:25 -04:00
ab3d3bcbfb Bump tar from 4.4.13 to 4.4.17 in /bindings/node (#775)
Bumps [tar](https://github.com/npm/node-tar) from 4.4.13 to 4.4.17.
- [Release notes](https://github.com/npm/node-tar/releases)
- [Changelog](https://github.com/npm/node-tar/blob/main/CHANGELOG.md)
- [Commits](https://github.com/npm/node-tar/compare/v4.4.13...v4.4.17)

---
updated-dependencies:
- dependency-name: tar
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-08-12 09:31:47 -04:00
5d1b0a9381 Bump glob-parent from 5.1.1 to 5.1.2 in /bindings/node (#734)
Bumps [glob-parent](https://github.com/gulpjs/glob-parent) from 5.1.1 to 5.1.2.
- [Release notes](https://github.com/gulpjs/glob-parent/releases)
- [Changelog](https://github.com/gulpjs/glob-parent/blob/main/CHANGELOG.md)
- [Commits](https://github.com/gulpjs/glob-parent/compare/v5.1.1...v5.1.2)

---
updated-dependencies:
- dependency-name: glob-parent
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-08-12 09:21:00 -04:00
96c122ccf6 Bump ws from 7.3.1 to 7.4.6 in /bindings/node (#721)
Bumps [ws](https://github.com/websockets/ws) from 7.3.1 to 7.4.6.
- [Release notes](https://github.com/websockets/ws/releases)
- [Commits](https://github.com/websockets/ws/compare/7.3.1...7.4.6)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-08-12 09:20:36 -04:00