Commit Graph

1759 Commits

Author SHA1 Message Date
72a1973cd1 chore: Remove CLI - this was originally intended for local development (#1442) 2024-02-13 04:05:43 +01:00
7f49f20ab0 version = "0.15.3-dev-0” 2024-02-12 09:48:00 +09:00
c893204c45 Efficient Replace normalizer (#1413)
* new Replace work

* clean up

* clean up

* typo

* cargo fmt

* Clippy.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-02-06 14:36:44 +01:00
4a8105c366 Convert word counts to u64 (#1433)
* Convert word counts to u64

* More spots needed to compile
2024-02-06 03:39:12 +01:00
67fe59c88d chore: Update dependencies to latest supported versions (#1441) 2024-01-22 17:54:37 +01:00
8f73fe9515 update dev version to 0.15.2-dev.0 2024-01-22 15:34:57 +01:00
accd0650b8 Update release for python3.12 windows (#1438)
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-01-19 15:56:47 +01:00
6a77d4859b Encode special tokens (#1437)
* add doc in the code

* add option to skip special tokens

* nits

* add api dummy for now

* Fmt.

* Fix fmt.

* Fix the stub.

* add a test

* add a test in python

* style it

* nits

* add getter and setters

* stub

* update python test

* fmt

* last nit

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-01-19 12:43:43 +01:00
888dd4bc65 pyo3: update to 0.20 (#1386)
Co-authored-by: Mike Lui <mikelui@meta.com>
2024-01-11 17:03:13 +01:00
8939d4e26d Bump follow-redirects in /tokenizers/examples/unstable_wasm/www (#1430)
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.15.1 to 1.15.4.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.15.1...v1.15.4)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-10 12:04:48 +01:00
43b31a83c7 Fix make bench. (#1428) 2024-01-08 09:53:51 +01:00
f1c23b8680 Add quick doc to byte_level.rs (#1420)
* Add quick doc to byte_level.rs

* Address PR comments
2024-01-03 10:25:07 +01:00
11462596d1 Faster HF dataset iteration in docs (#1414)
* Faster HF dataset iteration in docs

* Nit
2023-12-14 16:12:56 +01:00
8edec536a7 Fix doc links in readme (#1367)
* Fix doc links in readme

* even better?
2023-12-09 12:14:54 +01:00
8f9b945c75 Stale bot. (#1404) 2023-12-05 14:11:37 +01:00
daf361676b Derive Clone on Tokenizer, add Encoding.into_tokens() method (#1381)
* Add `into_tokens()` method

* derive clone

* Update tokenizers/src/tokenizer/encoding.rs

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-11-20 09:56:29 +01:00
e3bcef288b udpate to version = "0.15.1-dev0" (#1390)
* Apply suggestions from code review
2023-11-15 13:30:58 +01:00
f55822baea [pre_tokenizers] Fix sentencepiece based Metaspace (#1357)
* nits

* allow for legacy beahaviour without making any breaking changes

* add a todo

* set to legacy by default

* skip legacy serialization

* push correct update

* lint

* add deserialization test

* add a python test as well

* updates

* fix serialization tests

* nits

* python stylijng of the tests

* better tests

* fix offsets

* fix imports

* fmt

* update metaspace

* remove TODO

* use enm

* fix some tses

* nits

* use enum

* update tests

* syling

* remove impl from for PrependScheme

* use simple getters and setters

* lint

* update tests

* add test new == new_with_prepend_scheme

* revert a change

* use setters and getterts

* Update bindings/python/src/pre_tokenizers.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* nits

* use copy rather than ref

* nits format

* more nits

* allow option string

* enforce First Never Always camel cased

* nits

* refactor

* update test as well

* fmt

* nits

* properly error out

* Update bindings/python/src/pre_tokenizers.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* suggestion changes

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-11-14 18:05:07 +01:00
ee2af9e99a Allow huggingface_hub<1.0 (#1385) 2023-11-10 13:51:07 +01:00
648b33a09e Allow hf_hub 0.18 (#1383) 2023-11-06 14:12:05 +01:00
c718c53bb9 Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node (#1370)
Bumps [@babel/traverse](https://github.com/babel/babel/tree/HEAD/packages/babel-traverse) from 7.22.11 to 7.23.2.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.23.2/packages/babel-traverse)

---
updated-dependencies:
- dependency-name: "@babel/traverse"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-25 08:14:32 +02:00
985d49ae64 fix: remove useless token (#1371) 2023-10-19 14:29:01 +02:00
0d8c57da48 fix a clerical error in the comment (#1356) 2023-10-10 21:31:44 +02:00
4322056e6e Preparing release. (#1355)
* Preparing release.

* Fix new clippy
2023-10-06 12:56:36 +02:00
aed491df8c Fixing the progressbar. (#1353)
* Fixing the progressbar.

* Upgrade deps.

* Update cargo audit

* Ssh this action.

* Fixing esaxx by using slower rust version.

* Trying the new esaxx version.

* Publish.

* Get cache again.
2023-10-05 15:33:58 +02:00
7e8e69a22c Let's allow hf_hub < 1.0 (#1344)
* Let's allow hf_hub < 1.0

* Update bindings/python/pyproject.toml
2023-10-02 14:30:10 +02:00
18bd5e8f9d Added ability to inspect a 'Sequence' pre-tokenizer. (#1341)
* Added ability to inspect a 'Sequence' pre-tokenizer.

* Added ability to inspect a 'Sequence' pre-tokenizer.

* Added ability to inspect a 'Sequence' pre-tokenizer.

* Linting error.

* Fix.

* Revert rename,
2023-09-21 08:10:16 +02:00
2c565e42c7 update package version for dev (#1339) 2023-09-07 16:19:24 +02:00
3dce63f062 Merge pull request #1335 from ArthurZucker/update-added-tokens
Update added tokens
2023-09-07 12:48:54 +02:00
efec086f35 get_added_tokens_decoder returns BTREEMap 2023-09-06 12:24:30 +00:00
a7ace4480d python stub.py 2023-09-05 17:33:14 +00:00
f435af8b71 linting 2023-09-05 16:43:06 +00:00
26fdfc2bc3 style 2023-09-05 16:42:45 +00:00
b57e1c3f5d #[allow(dead_code)] // Suppress the "method is never used" warning 2023-09-05 16:42:22 +00:00
c3fa75fa0e nits 2023-09-05 15:40:13 +00:00
08af8ea9c3 make tests happy 2023-09-05 15:37:09 +00:00
531b06f6db update the get_vocab_size to compute actual length of the get_vocab function 2023-09-05 15:19:50 +00:00
f1da83f358 add support for get_added_tokens_decoder 2023-09-05 14:49:29 +00:00
e5fc051ad2 update 2023-09-05 13:34:43 +00:00
93b37f36dc styling 2023-09-04 20:54:55 +00:00
058e34b421 make special editable as well 2023-09-04 20:54:29 +00:00
2291c89896 python stub.py 2023-09-04 19:49:36 +00:00
b235f85527 clippy 2023-09-04 19:31:48 +00:00
9aab096da8 fmt 2023-09-04 19:31:05 +00:00
a59bb76aa1 update and todo 2023-09-04 19:21:38 +00:00
c599db1421 nits 2023-09-04 19:11:19 +00:00
d4008b0d7a cliipy 2023-09-04 19:11:05 +00:00
b117ac7f16 updates 2023-09-04 19:10:22 +00:00
a53dff9bc5 make content writable in python 2023-09-04 18:18:21 +00:00
d9829cdc6e fix more tests 2023-09-04 17:22:27 +00:00