implement a simple max_sentencepiece_length into BPE (#1228)

* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* implement a simple max_sentencepiece_length into BPE

Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way.

If this code were to be actually integrated some works to be done

Documentation describing the behavior and how it should be set.
Set default==0 so it doesnt act unless set
provide ways in the python binding for the user to set max token length

I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus.

* utilize Option<u16> for safer code.

* Other version.

* Update trainer.rs

clarify with type usize propagate max_length option

* change max_length into more descriptive name

in the documentation
https://huggingface.co/docs/tokenizers/api/trainers
unigramtrainer uses max_piece_length for similar function.
since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive.

* change variable name in trainer.rs

change max_merge_length into max_token_length

* Update trainer.rs

add several max_token_length declaration that were missing.
impl BpeTrainerBuilder
struct BpeTrainer

Add explanation for variable shadowing.

* Update trainer.rs

Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly.

* add max_token_length test

* Add bpe direct assert test

* Update trainer.rs

clarified test documentation

* Creating the bindings.

* Fix the default.

* Re-adding missing package-lock which I accidentally removed.

* ..

* Fixing trainer test.

* Fix.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
This commit is contained in:
Chris Ha
2023-05-16 17:08:19 +09:00
committed by GitHub
parent daf3fcc976
commit cefc41e8ec
6 changed files with 6799 additions and 11 deletions

View File

@ -162,6 +162,12 @@ macro_rules! setter {
///
/// end_of_word_suffix (:obj:`str`, `optional`):
/// A suffix to be used for every subword that is a end-of-word.
///
/// max_token_length (:obj:`int`, `optional`):
/// Prevents creating tokens longer than the specified size.
/// This can help with reducing polluting your vocabulary with
/// highly repetitive tokens like `======` for wikipedia
///
#[pyclass(extends=PyTrainer, module = "tokenizers.trainers", name = "BpeTrainer")]
pub struct PyBpeTrainer {}
#[pymethods]
@ -243,6 +249,16 @@ impl PyBpeTrainer {
setter!(self_, BpeTrainer, limit_alphabet, limit);
}
#[getter]
fn get_max_token_length(self_: PyRef<Self>) -> Option<usize> {
getter!(self_, BpeTrainer, max_token_length)
}
#[setter]
fn set_max_token_length(self_: PyRef<Self>, limit: Option<usize>) {
setter!(self_, BpeTrainer, max_token_length, limit);
}
#[getter]
fn get_initial_alphabet(self_: PyRef<Self>) -> Vec<String> {
getter!(
@ -315,6 +331,7 @@ impl PyBpeTrainer {
);
}
"limit_alphabet" => builder = builder.limit_alphabet(val.extract()?),
"max_token_length" => builder = builder.max_token_length(val.extract()?),
"initial_alphabet" => {
let alphabet: Vec<String> = val.extract()?;
builder = builder.initial_alphabet(