mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Split Pre-Tokenizer (#542)
* start playing around * make a first version * refactor * apply make format * add python bindings * add some python binding tests * correct pre-tokenizers * update auto-generated bindings * lint python bindings * add code node * add split to docs * refactor python binding a bit * cargo fmt * clippy and fmt in node * quick updates and fixes * Oops * Update node typings * Update changelog Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
This commit is contained in:
committed by
GitHub
parent
58e1d8de67
commit
dd399d2ad0
@ -21,6 +21,16 @@ to customize its behavior. This page lists most provided components.
|
||||
``Sequence([NFKC(), Lowercase()])``
|
||||
PreTokenizer.Sequence
|
||||
``Sequence([Punctuation(), WhitespaceSplit()])``
|
||||
SplitDelimiterBehavior.removed
|
||||
:obj:`removed`
|
||||
SplitDelimiterBehavior.isolated
|
||||
:obj:`isolated`
|
||||
SplitDelimiterBehavior.merged_with_previous
|
||||
:obj:`merged_with_previous`
|
||||
SplitDelimiterBehavior.merged_with_next
|
||||
:obj:`merged_with_next`
|
||||
SplitDelimiterBehavior.contiguous
|
||||
:obj:`contiguous`
|
||||
|
||||
.. entities:: rust
|
||||
|
||||
@ -36,6 +46,16 @@ to customize its behavior. This page lists most provided components.
|
||||
``Sequence::new(vec![NFKC, Lowercase])``
|
||||
PreTokenizer.Sequence
|
||||
``Sequence::new(vec![Punctuation, WhitespaceSplit])``
|
||||
SplitDelimiterBehavior.removed
|
||||
:obj:`Removed`
|
||||
SplitDelimiterBehavior.isolated
|
||||
:obj:`Isolated`
|
||||
SplitDelimiterBehavior.merged_with_previous
|
||||
:obj:`MergedWithPrevious`
|
||||
SplitDelimiterBehavior.merged_with_next
|
||||
:obj:`MergedWithNext`
|
||||
SplitDelimiterBehavior.contiguous
|
||||
:obj:`Contiguous`
|
||||
|
||||
.. entities:: node
|
||||
|
||||
@ -51,6 +71,16 @@ to customize its behavior. This page lists most provided components.
|
||||
..
|
||||
PreTokenizer.Sequence
|
||||
..
|
||||
SplitDelimiterBehavior.removed
|
||||
:obj:`removed`
|
||||
SplitDelimiterBehavior.isolated
|
||||
:obj:`isolated`
|
||||
SplitDelimiterBehavior.merged_with_previous
|
||||
:obj:`mergedWithPrevious`
|
||||
SplitDelimiterBehavior.merged_with_next
|
||||
:obj:`mergedWithNext`
|
||||
SplitDelimiterBehavior.contiguous
|
||||
:obj:`contiguous`
|
||||
|
||||
Normalizers
|
||||
----------------------------------------------------------------------------------------------------
|
||||
@ -203,6 +233,27 @@ the ByteLevel)
|
||||
|
||||
Output: ```"Hello", "123", "there"```
|
||||
|
||||
* - Split
|
||||
- Versatile pre-tokenizer that splits on provided pattern and according to provided behavior.
|
||||
The pattern can be inverted if necessary.
|
||||
|
||||
- pattern should be either a custom string or regexp.
|
||||
- behavior should be one of:
|
||||
|
||||
* :entity:`SplitDelimiterBehavior.removed`
|
||||
* :entity:`SplitDelimiterBehavior.isolated`
|
||||
* :entity:`SplitDelimiterBehavior.merged_with_previous`
|
||||
* :entity:`SplitDelimiterBehavior.merged_with_next`
|
||||
* :entity:`SplitDelimiterBehavior.contiguous`
|
||||
|
||||
- invert should be a boolean flag.
|
||||
|
||||
- Example with `pattern` = :obj:`" "`, `behavior` = :obj:`"isolated"`, `invert` = :obj:`False`:
|
||||
|
||||
Input: ``"Hello, how are you?"``
|
||||
|
||||
Output: ```"Hello,", " ", "how", " ", "are", " ", "you?"```
|
||||
|
||||
* - Sequence
|
||||
- Lets you compose multiple ``PreTokenizer`` that will be run in the given order
|
||||
- :entity:`PreTokenizer.Sequence`
|
||||
|
Reference in New Issue
Block a user