Commit Graph

88 Commits

Author SHA1 Message Date
Sylvain Gugger
6616e699f7 Expand documentation of UnigramTrainer (#770)
* Expand documentation of UnigramTrainer

* Put doc at the source

* Add signature

* make style

Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
2021-08-12 10:12:26 -04:00
SaulLu
da4c7b10e4 Add a way to specify the unknown token in SentencePieceUnigramTokenizer python implem (#762)
* add a way to specify the unknown token in `SentencePieceUnigramTokenizer`

* add test that verify that an exception is raised for the missing unknown token

* style

* add test tokens
2021-08-12 09:42:44 -04:00
Anthony MOI
3a002c1aa8 Python - prepare for release 0.10.3 2021-05-24 16:59:10 -04:00
Nicolas Patry
2e2e7558f7 Add CTC Decoder for Wave2Vec models (#693)
* Rust - add a CTCDecoder as a seperate mod

* Adding bindings to Node + Python.

* Clippy update.

* Stub.

* Fixing roberta.json URLs.

* Moving test files to hf.co.

* Update cargo check and clippy to 1.52.

* Inner ':' actually is used for domains in sphinx.

Making `domain` work correctly was just too much work so I went the easy
way and have global roles for the custom rust extension.

* Update struct naming and docs

* Update changelog

Co-authored-by: Thomaub <github.thomaub@gmail.com>
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-05-20 09:30:09 -04:00
Anthony MOI
32b3b7a0f2 Python - Prepare for release 0.10.2 2021-04-05 16:47:55 -04:00
Anthony MOI
bc8bbf637a Prepare for python v0.10.1 (#625) 2021-02-08 11:45:56 -05:00
Anthony MOI
d96442cbe8 Python - Prepare for release 0.10.1rc1 (#622) 2021-02-04 10:37:00 -05:00
Anthony MOI
96b9972842 Fix SentencePiece tokenizers conversion 2021-02-03 12:44:46 -05:00
Anthony MOI
719bea76b9 Python - Prepare for release 0.10.0 2021-01-12 16:34:04 -05:00
devfon
b9c6bea75e Add fuse_unk option to SentencePieceBPETokenizer (#574)
* Add fuse_unk option to SentencePieceBPETokenizer

* Fix style

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-01-12 16:07:59 -05:00
Anthony MOI
cca5d43038 Python - Fix breaking change in Model.save 2021-01-11 16:09:19 -05:00
Anthony MOI
49d11b1f69 Python - Add components getter/setters to BaseTokenizer 2021-01-11 16:08:38 -05:00
Anthony MOI
d94fa220b6 Python - Add train_from_iterator to implementations 2021-01-07 09:02:20 -05:00
Anthony MOI
0c6cc39eee Python - Update CHANGELOG and bump for release 2020-12-08 13:29:35 -05:00
Tal Perry
8916b6bb27 Add a visualization utility to render tokens and annotations in a notebook (#508)
* Draft functionality of visualization

* Added comments to make code more intelligble

* polish the styles

* Ensure colors are stable and comment the css

* Code clean up

* Made visualizer importable and added some docs

* Fix styling

* implement comments from PR

* Fixed the regex for UNK tokens and examples in notebook

* Converted docs to google format

* Added a notebook showing multiple languages and tokenizers

* Added visual indication of chars that are tokenized with >1 token

* Reorganize things a bit and fix import

* Update docs

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-12-04 10:25:56 -05:00
Anthony MOI
3a8627ce4d Improve docs and fix tests around training 2020-11-28 12:29:35 -05:00
Anthony MOI
6e364cb685 Python - Update CHANGELOG and stub files 2020-11-27 17:35:34 -05:00
Patrick von Platen
dd399d2ad0 Split Pre-Tokenizer (#542)
* start playing around

* make a first version

* refactor

* apply make format

* add python bindings

* add some python binding tests

* correct pre-tokenizers

* update auto-generated bindings

* lint python bindings

* add code node

* add split to docs

* refactor python binding a bit

* cargo fmt

* clippy and fmt in node

* quick updates and fixes

* Oops

* Update node typings

* Update changelog

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-27 17:07:03 -05:00
Anthony MOI
58e1d8de67 Python - Improve documentation for trainers 2020-11-23 11:52:51 -05:00
Anthony MOI
64441b54b1 Python - Improve documentation for post-processors 2020-11-23 11:52:51 -05:00
Anthony MOI
933a2a9c99 Python - Improve pre-tokenizers docs 2020-11-23 11:52:51 -05:00
Anthony MOI
5842b3db73 Python - Improve normalizers docs 2020-11-23 11:52:51 -05:00
Anthony MOI
c01c301743 Python - Improve documentation for decoders and remove useless kwargs 2020-11-23 11:52:51 -05:00
Anthony MOI
a50d4b7d25 Python - Improve documentation for models 2020-11-23 11:52:51 -05:00
Nick
dc60d4fc0c Fix BaseTokenizer enable_truncation docstring 2020-11-23 11:28:26 -05:00
Anthony MOI
387b8a1033 Generate pyi, fix tests and clippy warnings 2020-11-20 13:30:44 -05:00
Anthony MOI
224862fe0c Python - Make the trainer optional on Tokenizer.train 2020-11-20 13:30:44 -05:00
Anthony MOI
059d43b265 Add WordLevel trainer 2020-11-20 13:30:44 -05:00
Anthony MOI
58b618f98e Python - Update __init__.pyi 2020-11-17 15:28:41 -05:00
Nicolas Patry
352c92ad33 Automatically stubbing the pyi files while keeping inspecting ability (#509)
* First pass on automatic stubbing our python files.

* And now modifying all rust docs to be visible in Pyi files.

* Better assert fail message.

* Fixing github workflow.

* Removing types not exported anymore.

* Fixing `Tokenizer` signature.

* Disabling auto __init__.py.

* Re-enabling some types.

* Don't overwrite non automated __init__.py

* Automated most __init__.py

* Restubbing after rebase.

* Fixing env for tests.

* Install blakc in the env.

* Use PY35 target in stub.py

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-11-17 15:13:00 -05:00
Anthony MOI
75b41dab0f Python - Update CHANGELOG and bump version for 0.9.4 2020-11-09 16:36:04 -05:00
Anthony MOI
57d162b269 Add an Encoding.sequences to allow masking 2020-11-06 10:41:56 -05:00
Anthony MOI
385d25720a Simplify the API for Encoding.token_to_XXX 2020-11-06 10:41:56 -05:00
Anthony MOI
dce218ca28 Python - Encoding mappings handle sequence_id 2020-11-06 10:41:56 -05:00
Mohamed Al Salti
20c7045ba1 Update sentencepiece_unigram.py
Update the URL to `sentencepiece_model_pb2.py` in the error message.
2020-11-04 08:08:11 -05:00
Anthony MOI
d2fc0e4836 Doc - Update API Reference for Encoding 2020-11-02 17:07:27 -05:00
Anthony MOI
a86d49634c Doc - API Reference for most Tokenizer methods/attributes 2020-11-02 17:07:27 -05:00
Anthony MOI
79f02bb7f0 Doc - Updated API Reference for encode/encode_batch 2020-11-02 17:07:27 -05:00
taufique74
4929809af0 makes from_file() method static 2020-11-01 13:15:15 -05:00
Anthony MOI
2364d376f7 Python - Update CHANGELOG and bump to 0.9.3 for release 2020-10-26 16:40:24 -04:00
Anthony MOI
1a6f4b5204 Allow initial_alphabet on UnigramTrainer 2020-10-26 10:57:29 -04:00
Timur Ganiev
f7c61c267a Fixed BPE.read_files -> BPE.read_file in SentencePieceBPETokenizer 2020-10-26 10:57:14 -04:00
Anthony MOI
a2289d49b4 Finish exposing the UnicodeScripts PreTokenizer 2020-10-21 11:01:54 -04:00
Anthony MOI
91f602f744 Python - Update CHANGELOG and bump to 0.9.2 for release 2020-10-15 10:14:58 -04:00
Anthony MOI
f94a274702 Python - Update CHANGELOG and bump version for release 2020-10-13 14:45:21 -04:00
Anthony MOI
4f4ba4a11a Python - Bump version for 0.9.0 release 2020-10-09 13:00:19 -04:00
Anthony MOI
aebf510c5a Python - Update CHANGELOG and bump to 0.9.0.rc1 2020-09-29 10:24:24 -04:00
Nicolas Patry
6c25bb729b Update __init__.pyi 2020-09-29 10:09:10 -04:00
Anthony MOI
1070eb471e Python - Update bindings for TemplateProcessing 2020-09-29 10:09:10 -04:00
Anthony MOI
171a042ee0 Python - Bump version for dev4 release 2020-09-24 10:16:18 -04:00