Commit Graph

634 Commits

Author SHA1 Message Date
Nicolas Patry
076319d542 Aho corasick version for many added tokens. (#871)
* Aho corasick version.

* Remove test file.

* Compile on `stable`.
2022-01-06 16:04:51 +01:00
Nicolas Patry
8e0d66a254 New python version. 2022-01-04 14:58:02 +01:00
Nicolas Patry
6972e49f1d Fix the clippy warnings. (#869) 2022-01-04 14:32:07 +01:00
Nicolas Patry
1054e243e2 Fix invalid continuing subwrd prefix. (#864)
* Creating failing test for invalid continuing subwrd prefix.

* Test in rust + the associated fix.

* Clippy.

* Black.
2022-01-04 14:25:35 +01:00
Nicolas Patry
4122a33f09 Fixing missing direction in TruncationParams. (#868) 2022-01-04 14:21:46 +01:00
Nicolas Patry
7069988ffe Update to 0.11.1 2021-12-28 13:59:31 +01:00
Nicolas Patry
152880ab3e Adding truncation_side within TruncationParams. (#860)
* Add truncation to enable_truncation

* Fix typo

* Adding truncation_side within `TruncationParams`.

* Node serialization of this direction param.

* Update the test.

* Fixing warnings/lint.

* Adding stuff (can't local debug :( )

* Slow loop... ;(

* Stub.py.

Co-authored-by: Niels Rogge <niels.rogge1@gmail.com>
2021-12-28 12:37:06 +01:00
Luc Georges
c4c9de23a5 Feature: Handle invalid truncate direction (#858)
* refacto: TruncateDirection -> TruncationDirection

* feat(node): invalid direction will throw

* feat(python): invalid direction will throw

* Update bindings/node/lib/bindings/raw-encoding.test.ts

* Update bindings/python/tests/bindings/test_encoding.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2021-12-27 14:31:57 +01:00
Nicolas Patry
943f4ef469 Preparing for 0.11.0 Re-release. (#856)
* Starting from master again.

Upgrade libssl everywhere on quay

Extra is ubuntu based (running the quay in a container).

making only extra run + attempt to fix ssl update.

Extra with newer openssl versions.

`-y`.

Use checkoint@v2 + remove `-` from environment name.

Debugging back the conda release..

Attempt to use `base` env.

3.7 requires `activate-environement: true.

MacOS and windows don't run on manylinux.

Remove yum on windows/macOs.

Miniconda doesn't like manylinux2014 anymore ?

Attempting different approach for manylinux + conda.

Use wget.

Extra bracet.

Executing $filename

Activate the env.

Activate the env on eevery step that requires it.

Openssl-devel.

Activating env for extracting version ?

Retest all workflows.

Manylinux2010 requires checkout@v1

Run on tag for extra and conda again.

openssl-devel.

* Putting back into deploy state.

* Adding links in CHANGELOG.

* Remove clippy from changelog.
2021-12-23 16:43:48 +01:00
Luc Georges
04368b1998 Truncate Right (#841)
* feat(tokenizers): add truncate test case

* !feat(tokenizer): truncate right

* refacto(tokenizers): clippy

* feat(bindings): update bindings for truncate()

* fix(tokenizers): remove unsafe code

* refacto(tokenizers): truncate direction

* truncate direction enum
* compute parts ranges beforehand
* 2n space because encoding is dropped at the end of procedure
* update bindings
* add pip install in python bindings' make test

* fix(node): clippy asks to use unwrap_or_else

* fix(node): lint

* refacto(tokenizers): replace Vec<Range<usize>> by Vec<(usize, usize)>

* refacto(bindings): add match syntax

* refacto(tokenizers): use mem::replace instead of mem::swap

* refacto(tokenizers): assign value the normal way
2021-12-23 13:34:21 +01:00
Nicolas Patry
c1100ec542 Clippy fixes. (#846)
* Clippy fixes.

* Drop support for Python 3.6

* Remove other 3.6

* Re-enabling caches for build (5h + seems too long and issue seems
solved)

https://github.com/actions/virtual-environments/issues/572

* `npm audit fix`.

* Fix yaml ?

* Pyarrow issue fixed: https://github.com/huggingface/datasets/pull/2268

* Installing dev libraries.

* Install python dev elsewhere ?

* Typo.

* No sudo.

* ...

* Testing the GH again.

* Maybe v2 will fix ?

* Fixing tests on MacOS Python 3.8+
2021-12-15 15:55:48 +01:00
Anthony MOI
1dc19e0dd4 Fix Python README example 2021-10-07 16:56:48 +02:00
Anthony MOI
b0ee27847f Python - Prepare for release 0.11.0 (#799) 2021-09-08 03:15:47 -04:00
Anthony MOI
b8b584d4e5 Python - Pretty json saving defaults to true (#793)
* Python - Pretty json saving defaults to true

* Update changelog
2021-09-02 08:43:54 -04:00
Anthony Moi
e68aecc442 Python - Update Cargo.lock 2021-09-02 14:04:35 +02:00
Anthony Moi
35c96e5e3f Add tests for from_pretrained 2021-08-31 09:00:05 -04:00
Anthony Moi
ad7090a5c7 Improve READMEs for from_pretrained 2021-08-31 09:00:05 -04:00
Anthony Moi
a4d0f3dd18 Update docs for from_pretrained 2021-08-31 09:00:05 -04:00
Anthony Moi
6f9e867330 Better export for FromPretrainedParameters 2021-08-31 09:00:05 -04:00
Anthony Moi
e44fdee4a1 Python - Add bindings to Tokenizer.from_pretrained 2021-08-31 09:00:05 -04:00
Geoffrey Thomas
5982498195 Switch git dependencies in Cargo.toml back to regular versions (#728)
* Switch git dependencies in Cargo.toml back to regular versions

rayon-cond turned out to be a rustc bug that has been fixed for a while
(see cuviper/rayon-cond#2), so we can revert the git dependency.

numpy has released the commit in question as part of 0.12.

* Also update Cargo.lock files

Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
2021-08-13 09:32:00 -04:00
Vlad Artamonov
e2bf8daa3a Add SplitDelimiterBehavior to Punctuation constructor (#657)
Resolves: #642
2021-08-13 09:19:23 -04:00
kingyiusuen
c1100dcbe3 Fix typo in documentation (#743)
* Doc - Fix typo (And instance of -> An instance of)

* Add missing text_signature for WordLevel.from_file

Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
2021-08-13 08:08:23 -04:00
Sylvain Gugger
6616e699f7 Expand documentation of UnigramTrainer (#770)
* Expand documentation of UnigramTrainer

* Put doc at the source

* Add signature

* make style

Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
2021-08-12 10:12:26 -04:00
SaulLu
da4c7b10e4 Add a way to specify the unknown token in SentencePieceUnigramTokenizer python implem (#762)
* add a way to specify the unknown token in `SentencePieceUnigramTokenizer`

* add test that verify that an exception is raised for the missing unknown token

* style

* add test tokens
2021-08-12 09:42:44 -04:00
Nicolas Patry
256a71c1f2 Clippy 1.54. (#773) 2021-08-11 14:43:49 +02:00
Nicolas Patry
d83772d62c Fixing tokenizers with 1.53 (updated some dependencies + clippy) (#764) 2021-07-21 09:58:38 +02:00
Anthony MOI
755e5f5c1e Remove support for Python 3.5 (#714)
* Python - remove support for python 3.5

* revert ci

* revert build-wheels.sh

* Update CHANGELOG.md
2021-05-24 17:31:01 -04:00
Anthony MOI
3a002c1aa8 Python - prepare for release 0.10.3 2021-05-24 16:59:10 -04:00
Nicolas Patry
c046da7679 Fix stripping strings containing Unicode characters (#707)
* Strip seems to have been broken for a while on unicode strings.

- Includes a failing tests + fixed it.
- This function could maybe b optimized, we're scanning the string 3 times now.
  and once fully for chars.

* Update CHANGELOG.md

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-05-24 16:49:59 -04:00
Anthony MOI
4b7f8c2d7c Fix CHANGELOG.md 2021-05-24 16:16:40 -04:00
Lysandre Debut
4b0dc6b947 Fix SPM conversions (#686)
* Fix SPM conversions

* Update changelog

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-05-20 09:55:55 -04:00
Nicolas Patry
2e2e7558f7 Add CTC Decoder for Wave2Vec models (#693)
* Rust - add a CTCDecoder as a seperate mod

* Adding bindings to Node + Python.

* Clippy update.

* Stub.

* Fixing roberta.json URLs.

* Moving test files to hf.co.

* Update cargo check and clippy to 1.52.

* Inner ':' actually is used for domains in sphinx.

Making `domain` work correctly was just too much work so I went the easy
way and have global roles for the custom rust extension.

* Update struct naming and docs

* Update changelog

Co-authored-by: Thomaub <github.thomaub@gmail.com>
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-05-20 09:30:09 -04:00
Lysandre
e999a7b5f9 Revert "Fix SPM conversions"
This reverts commit e1ffe39764.
2021-04-21 18:09:58 -04:00
Lysandre
e1ffe39764 Fix SPM conversions 2021-04-21 18:09:49 -04:00
Anthony MOI
32b3b7a0f2 Python - Prepare for release 0.10.2 2021-04-05 16:47:55 -04:00
Anthony MOI
e1627654b4 Fix Clippy warnings for Rust 1.51 2021-04-05 16:05:48 -04:00
Anthony MOI
659a835d04 Python - Accept kwargs in Metaspace constructor
This is mainly for backward compatibility with Metaspace objects that used to contain a `str_rep` field
2021-04-05 16:05:48 -04:00
Anthony MOI
0fe9214f44 Fix BPE continuing_subword_prefix 2021-03-18 14:39:52 -04:00
Anthony MOI
f5e9bb89b7 Fix offsets for Precompiled corner case 2021-03-16 15:04:42 -04:00
Anthony MOI
56a9196030 Fix clippy warnings 2021-03-16 12:32:06 -04:00
Anthony MOI
bc8bbf637a Prepare for python v0.10.1 (#625) 2021-02-08 11:45:56 -05:00
Anthony MOI
d96442cbe8 Python - Prepare for release 0.10.1rc1 (#622) 2021-02-04 10:37:00 -05:00
Anthony MOI
57200144ca Python - Fix ByteLevel instantiation from state (#621) 2021-02-04 10:16:05 -05:00
Anthony MOI
a8f756494e Improve Model serialization/deserialization (#620) 2021-02-04 09:59:18 -05:00
Anthony MOI
6a29dbc070 Doc - Hotfix training from iterators tutorial 2021-02-03 15:50:09 -05:00
Anthony MOI
db22cb6315 Python - Fix Normalizer.normalize with PyNormalizedStringRefMut 2021-02-03 15:48:53 -05:00
Anthony MOI
355315e8d3 Rust - Fix offsets produced by Precompiled Normalizer 2021-02-03 15:46:45 -05:00
Anthony MOI
96b9972842 Fix SentencePiece tokenizers conversion 2021-02-03 12:44:46 -05:00
Anthony MOI
719bea76b9 Python - Prepare for release 0.10.0 2021-01-12 16:34:04 -05:00