29fef1e7aa
[remove black
] And use ruff ( #1436 )
...
* nits
* Fixing deps.
* Ruff update.
* Import order matters.
* Fix.
* Revert ruff fix.
* Visualizer.
* Putting back the imports.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2024-03-12 11:24:21 +01:00
72a1973cd1
chore: Remove CLI - this was originally intended for local development ( #1442 )
2024-02-13 04:05:43 +01:00
7f49f20ab0
version = "0.15.3-dev-0”
2024-02-12 09:48:00 +09:00
c893204c45
Efficient Replace normalizer ( #1413 )
...
* new Replace work
* clean up
* clean up
* typo
* cargo fmt
* Clippy.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2024-02-06 14:36:44 +01:00
4a8105c366
Convert word counts to u64 ( #1433 )
...
* Convert word counts to u64
* More spots needed to compile
2024-02-06 03:39:12 +01:00
67fe59c88d
chore: Update dependencies to latest supported versions ( #1441 )
2024-01-22 17:54:37 +01:00
8f73fe9515
update dev version to 0.15.2-dev.0
2024-01-22 15:34:57 +01:00
accd0650b8
Update release for python3.12 windows ( #1438 )
...
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2024-01-19 15:56:47 +01:00
6a77d4859b
Encode special tokens ( #1437 )
...
* add doc in the code
* add option to skip special tokens
* nits
* add api dummy for now
* Fmt.
* Fix fmt.
* Fix the stub.
* add a test
* add a test in python
* style it
* nits
* add getter and setters
* stub
* update python test
* fmt
* last nit
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2024-01-19 12:43:43 +01:00
888dd4bc65
pyo3: update to 0.20 ( #1386 )
...
Co-authored-by: Mike Lui <mikelui@meta.com >
2024-01-11 17:03:13 +01:00
8939d4e26d
Bump follow-redirects in /tokenizers/examples/unstable_wasm/www ( #1430 )
...
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects ) from 1.15.1 to 1.15.4.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases )
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.15.1...v1.15.4 )
---
updated-dependencies:
- dependency-name: follow-redirects
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-10 12:04:48 +01:00
43b31a83c7
Fix make bench. ( #1428 )
2024-01-08 09:53:51 +01:00
f1c23b8680
Add quick doc to byte_level.rs ( #1420 )
...
* Add quick doc to byte_level.rs
* Address PR comments
2024-01-03 10:25:07 +01:00
11462596d1
Faster HF dataset iteration in docs ( #1414 )
...
* Faster HF dataset iteration in docs
* Nit
2023-12-14 16:12:56 +01:00
8edec536a7
Fix doc links in readme ( #1367 )
...
* Fix doc links in readme
* even better?
2023-12-09 12:14:54 +01:00
8f9b945c75
Stale bot. ( #1404 )
2023-12-05 14:11:37 +01:00
daf361676b
Derive Clone
on Tokenizer
, add Encoding.into_tokens()
method ( #1381 )
...
* Add `into_tokens()` method
* derive clone
* Update tokenizers/src/tokenizer/encoding.rs
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2023-11-20 09:56:29 +01:00
e3bcef288b
udpate to version = "0.15.1-dev0" ( #1390 )
...
* Apply suggestions from code review
2023-11-15 13:30:58 +01:00
f55822baea
[pre_tokenizers
] Fix sentencepiece based Metaspace ( #1357 )
...
* nits
* allow for legacy beahaviour without making any breaking changes
* add a todo
* set to legacy by default
* skip legacy serialization
* push correct update
* lint
* add deserialization test
* add a python test as well
* updates
* fix serialization tests
* nits
* python stylijng of the tests
* better tests
* fix offsets
* fix imports
* fmt
* update metaspace
* remove TODO
* use enm
* fix some tses
* nits
* use enum
* update tests
* syling
* remove impl from for PrependScheme
* use simple getters and setters
* lint
* update tests
* add test new == new_with_prepend_scheme
* revert a change
* use setters and getterts
* Update bindings/python/src/pre_tokenizers.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
* nits
* use copy rather than ref
* nits format
* more nits
* allow option string
* enforce First Never Always camel cased
* nits
* refactor
* update test as well
* fmt
* nits
* properly error out
* Update bindings/python/src/pre_tokenizers.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
* suggestion changes
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2023-11-14 18:05:07 +01:00
ee2af9e99a
Allow huggingface_hub<1.0 ( #1385 )
2023-11-10 13:51:07 +01:00
648b33a09e
Allow hf_hub 0.18 ( #1383 )
2023-11-06 14:12:05 +01:00
c718c53bb9
Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node ( #1370 )
...
Bumps [@babel/traverse](https://github.com/babel/babel/tree/HEAD/packages/babel-traverse ) from 7.22.11 to 7.23.2.
- [Release notes](https://github.com/babel/babel/releases )
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md )
- [Commits](https://github.com/babel/babel/commits/v7.23.2/packages/babel-traverse )
---
updated-dependencies:
- dependency-name: "@babel/traverse"
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-25 08:14:32 +02:00
985d49ae64
fix: remove useless token ( #1371 )
2023-10-19 14:29:01 +02:00
0d8c57da48
fix a clerical error in the comment ( #1356 )
2023-10-10 21:31:44 +02:00
4322056e6e
Preparing release. ( #1355 )
...
* Preparing release.
* Fix new clippy
2023-10-06 12:56:36 +02:00
aed491df8c
Fixing the progressbar. ( #1353 )
...
* Fixing the progressbar.
* Upgrade deps.
* Update cargo audit
* Ssh this action.
* Fixing esaxx by using slower rust version.
* Trying the new esaxx version.
* Publish.
* Get cache again.
2023-10-05 15:33:58 +02:00
7e8e69a22c
Let's allow hf_hub < 1.0 ( #1344 )
...
* Let's allow hf_hub < 1.0
* Update bindings/python/pyproject.toml
2023-10-02 14:30:10 +02:00
18bd5e8f9d
Added ability to inspect a 'Sequence' pre-tokenizer. ( #1341 )
...
* Added ability to inspect a 'Sequence' pre-tokenizer.
* Added ability to inspect a 'Sequence' pre-tokenizer.
* Added ability to inspect a 'Sequence' pre-tokenizer.
* Linting error.
* Fix.
* Revert rename,
2023-09-21 08:10:16 +02:00
2c565e42c7
update package version for dev ( #1339 )
2023-09-07 16:19:24 +02:00
3dce63f062
Merge pull request #1335 from ArthurZucker/update-added-tokens
...
Update added tokens
2023-09-07 12:48:54 +02:00
efec086f35
get_added_tokens_decoder
returns BTREEMap
2023-09-06 12:24:30 +00:00
a7ace4480d
python stub.py
2023-09-05 17:33:14 +00:00
f435af8b71
linting
2023-09-05 16:43:06 +00:00
26fdfc2bc3
style
2023-09-05 16:42:45 +00:00
b57e1c3f5d
#[allow(dead_code)] // Suppress the "method is never used" warning
2023-09-05 16:42:22 +00:00
c3fa75fa0e
nits
2023-09-05 15:40:13 +00:00
08af8ea9c3
make tests happy
2023-09-05 15:37:09 +00:00
531b06f6db
update the get_vocab_size
to compute actual length of the get_vocab
function
2023-09-05 15:19:50 +00:00
f1da83f358
add support for get_added_tokens_decoder
2023-09-05 14:49:29 +00:00
e5fc051ad2
update
2023-09-05 13:34:43 +00:00
93b37f36dc
styling
2023-09-04 20:54:55 +00:00
058e34b421
make special editable as well
2023-09-04 20:54:29 +00:00
2291c89896
python stub.py
2023-09-04 19:49:36 +00:00
b235f85527
clippy
2023-09-04 19:31:48 +00:00
9aab096da8
fmt
2023-09-04 19:31:05 +00:00
a59bb76aa1
update and todo
2023-09-04 19:21:38 +00:00
c599db1421
nits
2023-09-04 19:11:19 +00:00
d4008b0d7a
cliipy
2023-09-04 19:11:05 +00:00
b117ac7f16
updates
2023-09-04 19:10:22 +00:00
a53dff9bc5
make content writable in python
2023-09-04 18:18:21 +00:00