Commit Graph

612 Commits

Author SHA1 Message Date
kingyiusuen
c1100dcbe3 Fix typo in documentation (#743)
* Doc - Fix typo (And instance of -> An instance of)

* Add missing text_signature for WordLevel.from_file

Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
2021-08-13 08:08:23 -04:00
Sylvain Gugger
6616e699f7 Expand documentation of UnigramTrainer (#770)
* Expand documentation of UnigramTrainer

* Put doc at the source

* Add signature

* make style

Co-authored-by: Anthony Moi <m.anthony.moi@gmail.com>
2021-08-12 10:12:26 -04:00
SaulLu
da4c7b10e4 Add a way to specify the unknown token in SentencePieceUnigramTokenizer python implem (#762)
* add a way to specify the unknown token in `SentencePieceUnigramTokenizer`

* add test that verify that an exception is raised for the missing unknown token

* style

* add test tokens
2021-08-12 09:42:44 -04:00
Nicolas Patry
256a71c1f2 Clippy 1.54. (#773) 2021-08-11 14:43:49 +02:00
Nicolas Patry
d83772d62c Fixing tokenizers with 1.53 (updated some dependencies + clippy) (#764) 2021-07-21 09:58:38 +02:00
Anthony MOI
755e5f5c1e Remove support for Python 3.5 (#714)
* Python - remove support for python 3.5

* revert ci

* revert build-wheels.sh

* Update CHANGELOG.md
2021-05-24 17:31:01 -04:00
Anthony MOI
3a002c1aa8 Python - prepare for release 0.10.3 2021-05-24 16:59:10 -04:00
Nicolas Patry
c046da7679 Fix stripping strings containing Unicode characters (#707)
* Strip seems to have been broken for a while on unicode strings.

- Includes a failing tests + fixed it.
- This function could maybe b optimized, we're scanning the string 3 times now.
  and once fully for chars.

* Update CHANGELOG.md

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-05-24 16:49:59 -04:00
Anthony MOI
4b7f8c2d7c Fix CHANGELOG.md 2021-05-24 16:16:40 -04:00
Lysandre Debut
4b0dc6b947 Fix SPM conversions (#686)
* Fix SPM conversions

* Update changelog

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-05-20 09:55:55 -04:00
Nicolas Patry
2e2e7558f7 Add CTC Decoder for Wave2Vec models (#693)
* Rust - add a CTCDecoder as a seperate mod

* Adding bindings to Node + Python.

* Clippy update.

* Stub.

* Fixing roberta.json URLs.

* Moving test files to hf.co.

* Update cargo check and clippy to 1.52.

* Inner ':' actually is used for domains in sphinx.

Making `domain` work correctly was just too much work so I went the easy
way and have global roles for the custom rust extension.

* Update struct naming and docs

* Update changelog

Co-authored-by: Thomaub <github.thomaub@gmail.com>
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-05-20 09:30:09 -04:00
Lysandre
e999a7b5f9 Revert "Fix SPM conversions"
This reverts commit e1ffe39764.
2021-04-21 18:09:58 -04:00
Lysandre
e1ffe39764 Fix SPM conversions 2021-04-21 18:09:49 -04:00
Anthony MOI
32b3b7a0f2 Python - Prepare for release 0.10.2 2021-04-05 16:47:55 -04:00
Anthony MOI
e1627654b4 Fix Clippy warnings for Rust 1.51 2021-04-05 16:05:48 -04:00
Anthony MOI
659a835d04 Python - Accept kwargs in Metaspace constructor
This is mainly for backward compatibility with Metaspace objects that used to contain a `str_rep` field
2021-04-05 16:05:48 -04:00
Anthony MOI
0fe9214f44 Fix BPE continuing_subword_prefix 2021-03-18 14:39:52 -04:00
Anthony MOI
f5e9bb89b7 Fix offsets for Precompiled corner case 2021-03-16 15:04:42 -04:00
Anthony MOI
56a9196030 Fix clippy warnings 2021-03-16 12:32:06 -04:00
Anthony MOI
bc8bbf637a Prepare for python v0.10.1 (#625) 2021-02-08 11:45:56 -05:00
Anthony MOI
d96442cbe8 Python - Prepare for release 0.10.1rc1 (#622) 2021-02-04 10:37:00 -05:00
Anthony MOI
57200144ca Python - Fix ByteLevel instantiation from state (#621) 2021-02-04 10:16:05 -05:00
Anthony MOI
a8f756494e Improve Model serialization/deserialization (#620) 2021-02-04 09:59:18 -05:00
Anthony MOI
6a29dbc070 Doc - Hotfix training from iterators tutorial 2021-02-03 15:50:09 -05:00
Anthony MOI
db22cb6315 Python - Fix Normalizer.normalize with PyNormalizedStringRefMut 2021-02-03 15:48:53 -05:00
Anthony MOI
355315e8d3 Rust - Fix offsets produced by Precompiled Normalizer 2021-02-03 15:46:45 -05:00
Anthony MOI
96b9972842 Fix SentencePiece tokenizers conversion 2021-02-03 12:44:46 -05:00
Anthony MOI
719bea76b9 Python - Prepare for release 0.10.0 2021-01-12 16:34:04 -05:00
devfon
b9c6bea75e Add fuse_unk option to SentencePieceBPETokenizer (#574)
* Add fuse_unk option to SentencePieceBPETokenizer

* Fix style

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-01-12 16:07:59 -05:00
Anthony MOI
91dae1de15 Doc - Add documentation for training from iterators 2021-01-12 15:51:38 -05:00
Anthony MOI
cca5d43038 Python - Fix breaking change in Model.save 2021-01-11 16:09:19 -05:00
Anthony MOI
49d11b1f69 Python - Add components getter/setters to BaseTokenizer 2021-01-11 16:08:38 -05:00
Anthony MOI
d94fa220b6 Python - Add train_from_iterator to implementations 2021-01-07 09:02:20 -05:00
Anthony MOI
817c5ad317 Fix clippy warnings for rust 1.49 2021-01-06 15:03:33 -05:00
Anthony MOI
5938a12b3f Python - Improve training with iterators 2021-01-06 11:38:43 -05:00
Anthony MOI
0c6cc39eee Python - Update CHANGELOG and bump for release 2020-12-08 13:29:35 -05:00
Tal Perry
8916b6bb27 Add a visualization utility to render tokens and annotations in a notebook (#508)
* Draft functionality of visualization

* Added comments to make code more intelligble

* polish the styles

* Ensure colors are stable and comment the css

* Code clean up

* Made visualizer importable and added some docs

* Fix styling

* implement comments from PR

* Fixed the regex for UNK tokens and examples in notebook

* Converted docs to google format

* Added a notebook showing multiple languages and tokenizers

* Added visual indication of chars that are tokenized with >1 token

* Reorganize things a bit and fix import

* Update docs

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2020-12-04 10:25:56 -05:00
Anthony MOI
5549fc4837 Python - Update CHANGELOG 2020-11-28 12:42:37 -05:00
Anthony MOI
3a8627ce4d Improve docs and fix tests around training 2020-11-28 12:29:35 -05:00
Anthony MOI
999067454d Make sure we first try to extract a string 2020-11-28 12:29:35 -05:00
Anthony MOI
ed9baeabb7 Add example for training with datasets 2020-11-28 12:29:35 -05:00
Anthony MOI
c36ac0bfdf Improve progress tracking while training 2020-11-28 12:29:35 -05:00
Anthony MOI
75deaecdd0 Also accept iterators of batches in train_from_iterator 2020-11-28 12:29:35 -05:00
Anthony MOI
e0a70f1fb2 Add ability to train from Iterator 2020-11-28 12:29:35 -05:00
Anthony MOI
6e364cb685 Python - Update CHANGELOG and stub files 2020-11-27 17:35:34 -05:00
Anthony MOI
a351d1c604 Python - Trainers can get/set their attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
3eb7ef6d0a Python - PreTokenizers can get/set their attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
5c35fafc44 Python - Decoders can get/set their attributes 2020-11-27 17:35:34 -05:00
Anthony MOI
091287dcf5 Python - Use macro for getter/setter in models 2020-11-27 17:35:34 -05:00
Anthony MOI
2feccdbbfa Python - PyStrip can get/set its attributes 2020-11-27 17:35:34 -05:00