Split Pre-Tokenizer (#542)

* start playing around

* make a first version

* refactor

* apply make format

* add python bindings

* add some python binding tests

* correct pre-tokenizers

* update auto-generated bindings

* lint python bindings

* add code node

* add split to docs

* refactor python binding a bit

* cargo fmt

* clippy and fmt in node

* quick updates and fixes

* Oops

* Update node typings

* Update changelog

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
This commit is contained in:
Patrick von Platen
2020-11-27 23:07:03 +01:00
committed by GitHub
parent 58e1d8de67
commit dd399d2ad0
17 changed files with 494 additions and 2 deletions

View File

@ -21,6 +21,16 @@ to customize its behavior. This page lists most provided components.
``Sequence([NFKC(), Lowercase()])``
PreTokenizer.Sequence
``Sequence([Punctuation(), WhitespaceSplit()])``
SplitDelimiterBehavior.removed
:obj:`removed`
SplitDelimiterBehavior.isolated
:obj:`isolated`
SplitDelimiterBehavior.merged_with_previous
:obj:`merged_with_previous`
SplitDelimiterBehavior.merged_with_next
:obj:`merged_with_next`
SplitDelimiterBehavior.contiguous
:obj:`contiguous`
.. entities:: rust
@ -36,6 +46,16 @@ to customize its behavior. This page lists most provided components.
``Sequence::new(vec![NFKC, Lowercase])``
PreTokenizer.Sequence
``Sequence::new(vec![Punctuation, WhitespaceSplit])``
SplitDelimiterBehavior.removed
:obj:`Removed`
SplitDelimiterBehavior.isolated
:obj:`Isolated`
SplitDelimiterBehavior.merged_with_previous
:obj:`MergedWithPrevious`
SplitDelimiterBehavior.merged_with_next
:obj:`MergedWithNext`
SplitDelimiterBehavior.contiguous
:obj:`Contiguous`
.. entities:: node
@ -51,6 +71,16 @@ to customize its behavior. This page lists most provided components.
..
PreTokenizer.Sequence
..
SplitDelimiterBehavior.removed
:obj:`removed`
SplitDelimiterBehavior.isolated
:obj:`isolated`
SplitDelimiterBehavior.merged_with_previous
:obj:`mergedWithPrevious`
SplitDelimiterBehavior.merged_with_next
:obj:`mergedWithNext`
SplitDelimiterBehavior.contiguous
:obj:`contiguous`
Normalizers
----------------------------------------------------------------------------------------------------
@ -203,6 +233,27 @@ the ByteLevel)
Output: ```"Hello", "123", "there"```
* - Split
- Versatile pre-tokenizer that splits on provided pattern and according to provided behavior.
The pattern can be inverted if necessary.
- pattern should be either a custom string or regexp.
- behavior should be one of:
* :entity:`SplitDelimiterBehavior.removed`
* :entity:`SplitDelimiterBehavior.isolated`
* :entity:`SplitDelimiterBehavior.merged_with_previous`
* :entity:`SplitDelimiterBehavior.merged_with_next`
* :entity:`SplitDelimiterBehavior.contiguous`
- invert should be a boolean flag.
- Example with `pattern` = :obj:`" "`, `behavior` = :obj:`"isolated"`, `invert` = :obj:`False`:
Input: ``"Hello, how are you?"``
Output: ```"Hello,", " ", "how", " ", "are", " ", "you?"```
* - Sequence
- Lets you compose multiple ``PreTokenizer`` that will be run in the given order
- :entity:`PreTokenizer.Sequence`