mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-23 00:35:35 +00:00
implement a simple max_sentencepiece_length into BPE (#1228)
* implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * implement a simple max_sentencepiece_length into BPE Add a way for the BPE trainer to behave like the unigram trainer where tokens longer than a certain lenght(default 16 in SPM) to be skipped. this is implemented in unigram trainer but in a different way. If this code were to be actually integrated some works to be done Documentation describing the behavior and how it should be set. Set default==0 so it doesnt act unless set provide ways in the python binding for the user to set max token length I was trying to find a way to implement max_sentencepiece_length through pretokenizer split rules and to be honest, its very difficult and regexes can be real slow when operating on the whole training corpus. * utilize Option<u16> for safer code. * Other version. * Update trainer.rs clarify with type usize propagate max_length option * change max_length into more descriptive name in the documentation https://huggingface.co/docs/tokenizers/api/trainers unigramtrainer uses max_piece_length for similar function. since BPE the underlying concept is merges, using max_merge_length as the variable name could prove more descriptive. * change variable name in trainer.rs change max_merge_length into max_token_length * Update trainer.rs add several max_token_length declaration that were missing. impl BpeTrainerBuilder struct BpeTrainer Add explanation for variable shadowing. * Update trainer.rs Move default definition of max_token_length to proper location. adjust downstream variable initializations accordingly. * add max_token_length test * Add bpe direct assert test * Update trainer.rs clarified test documentation * Creating the bindings. * Fix the default. * Re-adding missing package-lock which I accidentally removed. * .. * Fixing trainer test. * Fix. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
This commit is contained in:
@ -162,6 +162,12 @@ macro_rules! setter {
|
||||
///
|
||||
/// end_of_word_suffix (:obj:`str`, `optional`):
|
||||
/// A suffix to be used for every subword that is a end-of-word.
|
||||
///
|
||||
/// max_token_length (:obj:`int`, `optional`):
|
||||
/// Prevents creating tokens longer than the specified size.
|
||||
/// This can help with reducing polluting your vocabulary with
|
||||
/// highly repetitive tokens like `======` for wikipedia
|
||||
///
|
||||
#[pyclass(extends=PyTrainer, module = "tokenizers.trainers", name = "BpeTrainer")]
|
||||
pub struct PyBpeTrainer {}
|
||||
#[pymethods]
|
||||
@ -243,6 +249,16 @@ impl PyBpeTrainer {
|
||||
setter!(self_, BpeTrainer, limit_alphabet, limit);
|
||||
}
|
||||
|
||||
#[getter]
|
||||
fn get_max_token_length(self_: PyRef<Self>) -> Option<usize> {
|
||||
getter!(self_, BpeTrainer, max_token_length)
|
||||
}
|
||||
|
||||
#[setter]
|
||||
fn set_max_token_length(self_: PyRef<Self>, limit: Option<usize>) {
|
||||
setter!(self_, BpeTrainer, max_token_length, limit);
|
||||
}
|
||||
|
||||
#[getter]
|
||||
fn get_initial_alphabet(self_: PyRef<Self>) -> Vec<String> {
|
||||
getter!(
|
||||
@ -315,6 +331,7 @@ impl PyBpeTrainer {
|
||||
);
|
||||
}
|
||||
"limit_alphabet" => builder = builder.limit_alphabet(val.extract()?),
|
||||
"max_token_length" => builder = builder.max_token_length(val.extract()?),
|
||||
"initial_alphabet" => {
|
||||
let alphabet: Vec<String> = val.extract()?;
|
||||
builder = builder.initial_alphabet(
|
||||
|
Reference in New Issue
Block a user