Commit Graph

489 Commits

Author SHA1 Message Date
Nicolas Patry
6c25bb729b Update __init__.pyi 2020-09-29 10:09:10 -04:00
Anthony MOI
1070eb471e Python - Update bindings for TemplateProcessing 2020-09-29 10:09:10 -04:00
Dagmawi Moges
7f8b357b92 Fixed Dead Link: Build your own #435 (#436)
* Fixed Dead Link: Build your own #435

* Update bindings/python/README.md

Co-authored-by: Anthony MOI <xn1t0x@gmail.com>
2020-09-25 09:41:31 -04:00
Anthony MOI
a0a163fd62 Remove unwanted file 2020-09-24 14:05:47 -04:00
Anthony MOI
171a042ee0 Python - Bump version for dev4 release 2020-09-24 10:16:18 -04:00
Nicolas Patry
a410903051 Upgrading to black 20.8b1 2020-09-24 09:27:30 -04:00
Anthony MOI
8308508577 Python - Update bindings for Replace Normalizer 2020-09-24 08:05:57 -04:00
Nicolas Patry
598ce61229 Removed now wrong code in convert.py, fixed strange black magic. 2020-09-24 08:57:02 +02:00
Nicolas Patry
95cc8c47ad Changed rust api for merges, that is now Vec<(String, String)> 2020-09-24 08:57:02 +02:00
Nicolas Patry
36832bfa12 from_files -> from_file everywhere
- read_files -> read_file
- from_file pure rust impl in python bindings
- Fix some typing in python binding
- Added {BPE,WordLevel,WordPiece}.from_file tests.
2020-09-24 08:57:02 +02:00
Nicolas Patry
9672995a56 We use 19.10b0 not 20 here... 2020-09-24 08:57:02 +02:00
Nicolas Patry
35ee1968c0 Black *Version* check. 2020-09-24 08:57:02 +02:00
Nicolas Patry
9b1ef9d895 Black pre-commit after rebase. 2020-09-24 08:57:02 +02:00
Nicolas Patry
acd4a7599f Black. 2020-09-24 08:57:02 +02:00
Nicolas Patry
8f8156fd2c Adressing first pass of comments. 2020-09-24 08:57:02 +02:00
Nicolas Patry
1cd4824273 Black on pyi file. 2020-09-24 08:57:02 +02:00
Nicolas Patry
60c1e25910 New version. Staticmethods need to return a IntoPy<PyObject>
which is non trivial for PyClassInitializer. Instead I added a lower
staticmethod that returns raw objects, and the `from_file(s)` methods
are implemented directly in Python.
2020-09-24 08:57:02 +02:00
Nicolas Patry
98a30eead1 Temp work to make the APIs uniform (build from memory by default). 2020-09-24 08:57:02 +02:00
Anthony MOI
b24a2fc178 Some suggestions from @narsil 2020-09-23 15:50:01 -04:00
Anthony MOI
31b81f109b Python - Fix for PySlice on Windows 2020-09-23 15:50:01 -04:00
Anthony MOI
b9a051f464 Python - Update some missing typings 2020-09-23 15:50:01 -04:00
Anthony MOI
7492a1d698 Python - Update typings for NormalizedString 2020-09-23 15:50:01 -04:00
Anthony MOI
0b448f46d4 Python - Update typings for PreTokenizedString 2020-09-23 15:50:01 -04:00
Anthony MOI
b1097a988f Python - Improved example with custom components 2020-09-23 15:50:01 -04:00
Anthony MOI
0a930ef1d8 Python - Update bindings for PreTokenizer 2020-09-23 15:50:01 -04:00
Anthony MOI
53aad4eca0 Python - Update support for custom Decoder 2020-09-23 15:50:01 -04:00
Anthony MOI
08a3128515 Python - Add bindings for some Model methods 2020-09-23 15:50:01 -04:00
Anthony MOI
5276238b1b Python - Add bindings for PostProcessor.process 2020-09-23 15:50:01 -04:00
Anthony MOI
b6e7a6e2f7 Python - Update PyNormalizer interface 2020-09-23 15:50:01 -04:00
Anthony MOI
bd8f25ee2c Python - Add support for custom PreTokenizer 2020-09-23 15:50:01 -04:00
Anthony MOI
8d04b22278 Python - Add support for custom Normalizer 2020-09-23 15:50:01 -04:00
Anthony MOI
9245471dcd Python - Add bindings for PreTokenizedString 2020-09-23 15:50:01 -04:00
Anthony MOI
003d2ac6fb Python - Update PyToken bindings 2020-09-23 15:50:01 -04:00
Anthony MOI
fce6998dcf Python - Add bindings for NormalizedString 2020-09-23 15:50:01 -04:00
Anthony MOI
e4b10e0fd9 Python - Add RefMutGuard to safely share &mut 2020-09-23 15:50:01 -04:00
Anthony MOI
a42e13a644 Setup black format in pyproject.toml 2020-09-23 11:58:35 -04:00
Nicolas Patry
9d3a93db5b Going back for not fuse_unk by default for BPE, but add a flag to
enable it.
2020-09-22 16:27:09 -04:00
Anthony MOI
940f8bd8fa Update PyO3 (#426) 2020-09-22 12:00:20 -04:00
Nicolas Patry
c536b4992b Move to dev3 build. 2020-09-22 08:21:38 +02:00
Nicolas Patry
07197e8e35 Move to spm_precompiled 0.1.2 for smaller binary string. 2020-09-22 08:21:38 +02:00
Nicolas Patry
033b98ce59 Updating convert scripts with Replace normalizer. 2020-09-22 08:21:38 +02:00
Nicolas Patry
c59b216baa Fixing convert/check scripts. 2020-09-22 08:21:38 +02:00
Nicolas Patry
c0b9229833 Fixed vietnamese bug, now we have a thai bug. 2020-09-22 08:21:38 +02:00
Nicolas Patry
b16406c900 Moving StripAccents within normalizer for Albert +XLNet, but now crash
in Precompiled. offsets are wrong ?
2020-09-22 08:21:38 +02:00
Nicolas Patry
275ee6d4c4 Making convert script machine agnostic. 2020-09-22 08:21:38 +02:00
Nicolas Patry
2fd1d9cf06 Adding a new convert script, that will convert all python Tokenizer code
into a proper Rust Tokenizer format and check it on a file.

- Also fuse_unks by default in `tokenizers`'s BPE.
2020-09-22 08:21:38 +02:00
Nicolas Patry
aea22a4004 Adding node bindings.
- simplify normalizer.
- simplify python bindings.
2020-09-18 12:24:39 +02:00
Nicolas Patry
792d618006 Adding a new "Replace" normalizer that takes a string and replaces it
with another String (for now).
2020-09-18 12:24:39 +02:00
Nicolas Patry
75464734df Adding a new normalizer that strips accents by removing combining (#416)
* Adding a new normalizer that strips accents by removing combining

characters in unicode strings.

* Adding Node bindings

+ better normalizer impl.

* Doc comment -> Regular comment.
2020-09-17 09:49:41 +02:00
Nicolas Patry
330876ae02 Improvements on spm parity: (#401)
* Removing all pre_tokenizer logic from Unigram algorithm.

* Improving *a lot* the parity check.

- We can now detect a lot more errors
- Special cases have been added temporarily.

* Adding 2 new normalizers that mimick spm defaut's behavior.

* Adding `encoding_optimized` version of the `encode` algorithm.

- Removes Lattice allocation.
- Changes trie `common_prefix_search` to return an iterator to avoid
  allocation of the full results.

* Trie<char> -> Trie<u8> Another improvement on speed.

* [WIP] Attempt to create a Precompiled Normalizer from SPM to be 100%
compliant with arbitrary models.

* Adding a new `Precompiled` Normalizer that is replacing `SpmNmtNfkc`.

- It will be used for direct compatiblity with `Spm` and replace all
their custom rules by using directly the normalizer spec embedded
within spm files, removing all need for any rules for us.
- We need `nom` dependency to parse the binary format of `spm`.
- We need to add `sentencepiece_model_pb2.py` file to be able to read
  the proto file.
- We reimplemented their `Darts::DoubleArray` compact trie format.

* Fixing a bug with Precompiled normalizer.

* Fixing some edge cases (now in tests) with this weird precompiled
normalizer.

It seems a very handy crafted trie does not prevent from shooting
oneself in the foot. Sorry future reader.

* Keep API stable for this PR (change of the API should come later #409).

- Removed sentencepiece_model_pb2 from binding and add instructions to
make `from_spm` work.

* Adding model check in `from_spm`.

* Adressing @n1t0's comments.

* Adding a check to make sure alignments stay correct.

Also added a bit more documentation on how Precompiled works.

* Extracting `Precompiled` into it's own `spm_precompiled` crate.

* Using ranges in `do_nmt`.
2020-09-15 22:21:02 +02:00