Improvements on spm parity: (#401)

* Removing all pre_tokenizer logic from Unigram algorithm.

* Improving *a lot* the parity check.

- We can now detect a lot more errors
- Special cases have been added temporarily.

* Adding 2 new normalizers that mimick spm defaut's behavior.

* Adding `encoding_optimized` version of the `encode` algorithm.

- Removes Lattice allocation.
- Changes trie `common_prefix_search` to return an iterator to avoid
  allocation of the full results.

* Trie<char> -> Trie<u8> Another improvement on speed.

* [WIP] Attempt to create a Precompiled Normalizer from SPM to be 100%
compliant with arbitrary models.

* Adding a new `Precompiled` Normalizer that is replacing `SpmNmtNfkc`.

- It will be used for direct compatiblity with `Spm` and replace all
their custom rules by using directly the normalizer spec embedded
within spm files, removing all need for any rules for us.
- We need `nom` dependency to parse the binary format of `spm`.
- We need to add `sentencepiece_model_pb2.py` file to be able to read
  the proto file.
- We reimplemented their `Darts::DoubleArray` compact trie format.

* Fixing a bug with Precompiled normalizer.

* Fixing some edge cases (now in tests) with this weird precompiled
normalizer.

It seems a very handy crafted trie does not prevent from shooting
oneself in the foot. Sorry future reader.

* Keep API stable for this PR (change of the API should come later #409).

- Removed sentencepiece_model_pb2 from binding and add instructions to
make `from_spm` work.

* Adding model check in `from_spm`.

* Adressing @n1t0's comments.

* Adding a check to make sure alignments stay correct.

Also added a bit more documentation on how Precompiled works.

* Extracting `Precompiled` into it's own `spm_precompiled` crate.

* Using ranges in `do_nmt`.
This commit is contained in:
Nicolas Patry
2020-09-15 22:21:02 +02:00
committed by GitHub
parent 62c3d40f11
commit 330876ae02
22 changed files with 897 additions and 207 deletions

View File

@ -7,7 +7,9 @@ use pyo3::types::*;
use crate::error::ToPyResult;
use serde::ser::SerializeStruct;
use serde::{Deserialize, Serialize, Serializer};
use tk::normalizers::{BertNormalizer, Lowercase, NormalizerWrapper, Strip, NFC, NFD, NFKC, NFKD};
use tk::normalizers::{
BertNormalizer, Lowercase, Nmt, NormalizerWrapper, Precompiled, Strip, NFC, NFD, NFKC, NFKD,
};
use tk::{NormalizedString, Normalizer};
use tokenizers as tk;
@ -45,6 +47,10 @@ impl PyNormalizer {
NormalizerWrapper::Lowercase(_) => {
Py::new(py, (PyLowercase {}, base)).map(Into::into)
}
NormalizerWrapper::Precompiled(_) => {
Py::new(py, (PyPrecompiled {}, base)).map(Into::into)
}
NormalizerWrapper::Nmt(_) => Py::new(py, (PyNmt {}, base)).map(Into::into),
},
}
}
@ -273,6 +279,37 @@ impl Normalizer for PyNormalizerWrapper {
}
}
#[pyclass(extends=PyNormalizer, module = "tokenizers.normalizers", name=Nmt)]
pub struct PyNmt {}
#[pymethods]
impl PyNmt {
#[new]
fn new() -> PyResult<(Self, PyNormalizer)> {
Ok((PyNmt {}, Nmt.into()))
}
}
#[pyclass(extends=PyNormalizer, module = "tokenizers.normalizers", name=Precompiled)]
pub struct PyPrecompiled {}
#[pymethods]
impl PyPrecompiled {
#[new]
fn new(py_precompiled_charsmap: &PyBytes) -> PyResult<(Self, PyNormalizer)> {
let precompiled_charsmap: &[u8] = FromPyObject::extract(py_precompiled_charsmap)?;
Ok((
PyPrecompiled {},
Precompiled::from(precompiled_charsmap)
.map_err(|e| {
exceptions::Exception::py_err(format!(
"Error while attempting to build Precompiled normalizer: {}",
e.to_string()
))
})?
.into(),
))
}
}
#[cfg(test)]
mod test {
use pyo3::{AsPyRef, Python};