e5d781d5b9
update pyo3 and rust-numpy depends for no-gil/free-threading compat ( #1774 )
...
Signed-off-by: root <root@gpu-xl.lxd >
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
2025-05-27 11:31:58 +02:00
01f8bc834c
clippy ( #1781 )
...
* clippy
* fmtr
* rutc?
* fix onig issue
* up
* decode stream default
* jump a release for cargo audit ...
* more cliippy stuff
* clippy?
* proper style
* fmt
2025-05-27 11:30:32 +02:00
23e7e42adf
Fix data path in test_continuing_prefix_trainer_mismatch ( #1747 )
2025-05-27 08:48:27 +02:00
cc01186fd7
Fix type notation of merges in BPE Python binding ( #1766 )
2025-05-27 08:23:58 +02:00
f1faec1756
Fix typos in strings and comments ( #1770 )
2025-05-27 08:17:36 +02:00
4383a25787
Update the release builds following 0.21.1. ( #1746 )
...
* Update the release builds following 0.21.1.
* Clippy fix.
2025-03-13 13:01:41 +01:00
fbe3365a13
Update metadata as Python3.7 and Python3.8 support was dropped ( #1724 )
...
* Update metadata as python3.7 and python3.8 support was dropped
* Format pyproject.toml: unify quotes and indentation
2025-02-11 10:52:59 +01:00
c45aebd102
🚨 Support updating template processors ( #1652 )
...
* current updates
* simplify
* set_item works, but `tokenizer._tokenizer.post_processor[1].single = ["$0", "</s>"]` does not !
* fix: `normalizers` deserialization and other refactoring
* fix: `pre_tokenizer` deserialization
* feat: add `__len__` implementation for `normalizer::PySequence`
* feat: add `__setitem__` impl for `normalizers::PySequence`
* feat: add `__setitem__` impl to `pre_tokenizer::PySequence`
* feat: add `__setitem__` impl to `post_processor::PySequence`
* test: add normalizer sequence setter check
* refactor: allow unused `processors::setter` macro
* test: add `__setitem__` test for processors & pretok
* refactor: `unwrap` -> `PyException::new_err()?`
* refactor: fmt
* refactor: remove unnecessary `pub`
* feat(bindings): add missing getters & setters for pretoks
* feat(bindings): add missing getters & setters for processors
* refactor(bindings): rewrite RwLock poison error msg
* refactor: remove debug print
* feat(bindings): add description as to why custom deser is needed
* feat: make post proc sequence elements mutable
* fix(binding): serialization
---------
Co-authored-by: Luc Georges <luc.sydney.georges@gmail.com >
2025-01-28 14:58:35 +01:00
0ff2ab0f64
Fixing the stream by removing the read_index altogether. ( #1716 )
...
* Fixing the stream by removing the read_index altogether.
* Moving the test location because.. Windows.
* Ok whatever.
* Rust 1.84
* Fmt.
2025-01-09 17:41:15 +01:00
bdfc38b78d
Fix typos ( #1715 )
...
* Fix typos
Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com >
* Update docs/source/quicktour.rst
* Update docs/source-doc-builder/quicktour.mdx
---------
Signed-off-by: tinyboxvk <13696594+tinyboxvk@users.noreply.github.com >
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2025-01-09 11:53:20 +01:00
6945933829
update Split pretokenizer docstrings ( #1701 )
2025-01-08 12:35:52 +01:00
3a6504d274
Upgrade to PyO3 0.23 ( #1708 )
...
* Upgrade to PyO3 0.23
* Macos-12 deprecated?
* Clippy.
* Clippy auto ellision.
2024-12-31 18:36:01 +01:00
24d29f498d
Update dev version and pyproject.toml ( #1693 )
...
* update pyproject.toml
* update py dev version
2024-11-27 16:01:48 +01:00
1bf2a66b80
v0.20.4-dev0
2024-11-27 10:07:49 +01:00
eb4cc86d4e
Bump cross-spawn from 6.0.5 to 6.0.6 in /bindings/node ( #1687 )
...
Bumps [cross-spawn](https://github.com/moxystudio/node-cross-spawn ) from 6.0.5 to 6.0.6.
- [Changelog](https://github.com/moxystudio/node-cross-spawn/blob/v6.0.6/CHANGELOG.md )
- [Commits](https://github.com/moxystudio/node-cross-spawn/compare/v6.0.5...v6.0.6 )
---
updated-dependencies:
- dependency-name: cross-spawn
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-25 10:04:06 +01:00
ac34660e44
Fix encode_batch and encode_batch_fast to accept ndarrays again ( #1679 )
...
* Fix encode_batch and encode_batch_fast to accept ndarrays again
* Fix clippy
---------
Co-authored-by: Dimitris Iliopoulos <diliopoulos@fb.com >
2024-11-21 11:55:11 +01:00
cc5fb01a2f
Decode stream python ( #1678 )
...
* Python binding for decode stream
Different API because Python cannot handle lifetimes properly.
* Clippy.
2024-11-15 12:06:22 +01:00
f4c9fd7f40
Testing ABI3 wheels to reduce number of wheels ( #1674 )
...
* Testing ABI3 wheels to reduce number of wheels
* No need for py-clone anymore.
* Upgrade python versions.
* Remove those flakes.
* Promoting new CI + Fixing secret.
2024-11-15 06:02:22 +01:00
c6b5c3eab7
More cache options. ( #1675 )
...
* More cache options.
* Fixing error messages.
2024-11-06 11:12:09 +01:00
57884ebaa2
[MINOR:TYPO] Fix docstrings ( #1653 )
...
* [MINOR:TYPO] Update pre_tokenizers.rs
* [MINOR:TYPO] Update __init__.pyi
2024-11-05 16:25:06 +01:00
5e223ceb48
fix pylist ( #1673 )
...
* fix pylist
* add comment about why we use PySequence
* style
* fix encode batch fast as well
* Update bindings/python/src/tokenizer.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
* fix with capacity
* stub :)
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2024-11-05 16:24:23 +01:00
7c36735389
v0.20.2-dev.0 version
2024-11-04 18:36:40 +01:00
6ade8c2d21
PyO3 0.22 ( #1665 )
...
* PyO3 0.22
* Fix python stubs
* Remove name arg from PyModel::save Python signature
---------
Co-authored-by: Dimitris Iliopoulos <diliopoulos@fb.com >
2024-11-01 10:17:23 +01:00
6ea758872d
Unsound call of set_var
( #1664 )
...
* refactor: lift cloning to caller
* refactor: do not elide lifetimes as in Rust 2018
* fix: unsound use of env::set_var, was breaking stdlib change to make unsafe
It is generally not safe to set env variables. The correct way to set a config
value that needs to be overridden is to hold a copy internal to the library and
only read from the environment.
2024-10-25 15:44:30 +02:00
a8738a95d1
Arg name correction: auth_token -> token ( #1621 )
...
* Arg name correction: auth_token -> token
* Arg name correction in .rs: auth_token -> token
* update from_pretrained.rs file as well
---------
Co-authored-by: Rene Ravenel <rene@Renes-MacBook-Pro.local >
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com >
2024-10-24 16:32:09 +02:00
51826532d4
push new dev version
2024-10-10 12:00:16 +02:00
3d51a1695f
Fix documentation build ( #1642 )
...
* use v4
* fix ruff
* style
2024-10-01 14:48:02 +02:00
81c471cf17
update dev version 0.20.0
2024-08-08 18:11:50 +02:00
bfd9cdeefb
Perf improvement 16% by removing offsets. ( #1587 )
...
* [Breaking Change] Perf improvement 16% by removing offsets.
Offsets calculation are always calculated in Python land.
By changing it to not being calculated, we win 16% of the runtime.
This is not the total extent of it because offsets are
still calculated in bytes.
* Required features.
* Remove clippy error.
* Make it non breaking and still show perf improvement.
* Even faster without offsets.
* Update doc.
* Fmt.
* Apply suggestions from code review
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
* fmt.
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
2024-08-08 14:56:13 +02:00
49dafd707e
Fix strip python type ( #1602 )
...
* update
* the fix
* Revert "update"
This reverts commit 4c2f32f116479b0ec8ccd7c832f86cbc8787d8a9.
* add a test and rebase
* style
* oups
2024-08-07 15:36:28 +02:00
bded212356
Support None
to reset pre_tokenizers and normalizers, and index sequences ( #1590 )
...
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* initial commit
* support None
* fix clippy
* cleanup
* clean?
* propagate to pre_tokenizer
* fix test
* fix rust tests
* fix node
* propagate to decoder and post processor
* fix calls
* lint
* fmt
* node be happy I am fixing you
* add a small test
* styling
* style merge
* fix merge test
* fmt
* nits
* update tset
2024-08-07 12:52:35 +02:00
eea8e1ae6f
Fix doc about split ( #1591 )
...
* update doc
* add example
* Update bindings/python/src/pre_tokenizers.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
* stub
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2024-08-07 12:35:01 +02:00
ab9c7ded8b
Using serde (serde_pyo3) to get __str__ and __repr__ easily. ( #1588 )
...
* Using serde (serde_pyo3) to get __str__ and __repr__ easily.
* Putting it within tokenizers, it needs to be too specific.
* Clippy is our friend.
* Ruff.
* Update the tests.
* Pretty sure this is wrong (#1589 )
* Adding support for ellipsis.
* Fmt.
* Ruff.
* Fixing tokenizer.
---------
Co-authored-by: Eric Buehler <65165915+EricLBuehler@users.noreply.github.com >
2024-08-07 12:08:29 +02:00
a010f6b75c
Revert "Using serde (serde_pyo3) to get __str__ and __repr__ easily."
...
This reverts commit 86138337fc
.
2024-08-02 18:42:57 +02:00
86138337fc
Using serde (serde_pyo3) to get __str__ and __repr__ easily.
2024-08-02 18:41:54 +02:00
7415e28536
Enabling the option to use fancy_regex instead of onig
.
2024-08-01 15:53:17 +02:00
1df498a186
Fixing benchmark2.
2024-08-01 15:52:39 +02:00
c6f2c0b057
Fixing the benchmark. ( #1583 )
2024-08-01 10:36:53 +02:00
35f338a7b8
Add benchmark vs tiktoken ( #1582 )
...
* Adding a simple tiktoken benchmark.
* Adding 1 large fused document case.
2024-07-31 17:09:23 +02:00
4ea2f235b0
Add bytelevel normalizer to fix decode when adding tokens to BPE ( #1555 )
...
* feature dependent test
* nit about 嗎
* update
* actuallyfix it
* update the test
add it
fix
* stub
* Update tokenizers/src/pre_tokenizers/byte_level.rs
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com >
* skip failing test
* add normalizer to init
---------
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com >
2024-07-15 12:12:03 +02:00
fdd26ba9a3
Enable dropout = 0.0
as an equivalent to none
in BPE ( #1550 )
...
* enable dropout = 0.0
* typo
* lint
* formatter
* enable dropout = 0.0
* formatter
2024-06-24 12:36:11 +02:00
1ff56c0c70
Fix 'dictionnary' typo ( #1511 )
2024-06-11 15:43:47 +02:00
88f51fe7d2
Switch from cached_download to hf_hub_download in tests ( #1547 )
2024-06-11 15:26:58 +02:00
f2ec3b239b
remove enforcement of non special when adding tokens ( #1521 )
...
* remove enforcement of non special when adding tokens
* mut no longer needed
* add a small test
* nit
* style
* audit
* ignore cargo audit's own vulnerability
* update
* revert
* remove CVE
2024-04-30 15:53:47 +02:00
71c2a8d01a
update dev version so 0.19.1
2024-04-17 23:17:12 +02:00
91393ef75e
Fixing doc. ( #1499 )
...
* Fixing doc.
* SentencePieceUnigram and Convert.py still used sentencepiece
* stub
---------
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com >
2024-04-17 09:32:40 +02:00
949d9e3e0e
Bumping all versions 3 times (ty transformers :) ) ( #1498 )
2024-04-16 15:58:36 +02:00
d5a8cc7a49
PyO3 0.21. ( #1494 )
...
* PyO3 0.21.
* Upgraded everything.
* Rustfmt.
2024-04-16 13:49:52 +02:00
914576f7ed
Add more support for tiktoken based tokenizers ( #1493 )
...
* first commit
* update
* clippy
* lint
* clippy and lint
* fmt
* revert print
* 😈
* style
* add a test
* more fmt
* Use ignore_merges
* stub
* fix
* update
* Update tokenizers/src/models/bpe/model.rs
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
* update
* rust lint
* dob; t repeat yourself
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com >
2024-04-15 17:26:36 +02:00
6e58f838b3
version = "0.16.0-dev.0"
2024-04-02 09:51:14 +02:00