mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Doc - Update components page
This commit is contained in:
@ -6,6 +6,52 @@ to customize its behavior. This page lists most provided components.
|
|||||||
|
|
||||||
.. _normalizers:
|
.. _normalizers:
|
||||||
|
|
||||||
|
|
||||||
|
.. entities:: python
|
||||||
|
|
||||||
|
BertNormalizer.clean_text
|
||||||
|
clean_text
|
||||||
|
BertNormalizer.handle_chinese_chars
|
||||||
|
handle_chinese_chars
|
||||||
|
BertNormalizer.strip_accents
|
||||||
|
strip_accents
|
||||||
|
BertNormalizer.lowercase
|
||||||
|
lowercase
|
||||||
|
Normalizer.Sequence
|
||||||
|
``Sequence([NFKC(), Lowercase()])``
|
||||||
|
PreTokenizer.Sequence
|
||||||
|
``Sequence([Punctuation(), WhitespaceSplit()])``
|
||||||
|
|
||||||
|
.. entities:: rust
|
||||||
|
|
||||||
|
BertNormalizer.clean_text
|
||||||
|
clean_text
|
||||||
|
BertNormalizer.handle_chinese_chars
|
||||||
|
handle_chinese_chars
|
||||||
|
BertNormalizer.strip_accents
|
||||||
|
strip_accents
|
||||||
|
BertNormalizer.lowercase
|
||||||
|
lowercase
|
||||||
|
Normalizer.Sequence
|
||||||
|
``Sequence::new(vec![NFKC, Lowercase])``
|
||||||
|
PreTokenizer.Sequence
|
||||||
|
``Sequence::new(vec![Punctuation, WhitespaceSplit])``
|
||||||
|
|
||||||
|
.. entities:: node
|
||||||
|
|
||||||
|
BertNormalizer.clean_text
|
||||||
|
cleanText
|
||||||
|
BertNormalizer.handle_chinese_chars
|
||||||
|
handleChineseChars
|
||||||
|
BertNormalizer.strip_accents
|
||||||
|
stripAccents
|
||||||
|
BertNormalizer.lowercase
|
||||||
|
lowercase
|
||||||
|
Normalizer.Sequence
|
||||||
|
..
|
||||||
|
PreTokenizer.Sequence
|
||||||
|
..
|
||||||
|
|
||||||
Normalizers
|
Normalizers
|
||||||
----------------------------------------------------------------------------------------------------
|
----------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
@ -65,11 +111,20 @@ The ``Normalizer`` is optional.
|
|||||||
Input: ``"banana"``
|
Input: ``"banana"``
|
||||||
Ouput: ``"benene"``
|
Ouput: ``"benene"``
|
||||||
|
|
||||||
|
* - BertNormalizer
|
||||||
|
- Provides an implementation of the Normalizer used in the original BERT. Options
|
||||||
|
that can be set are:
|
||||||
|
|
||||||
|
- :entity:`BertNormalizer.clean_text`
|
||||||
|
- :entity:`BertNormalizer.handle_chinese_chars`
|
||||||
|
- :entity:`BertNormalizer.strip_accents`
|
||||||
|
- :entity:`BertNormalizer.lowercase`
|
||||||
|
|
||||||
|
-
|
||||||
|
|
||||||
* - Sequence
|
* - Sequence
|
||||||
- Composes multiple normalizers that will run in the provided order
|
- Composes multiple normalizers that will run in the provided order
|
||||||
- Example::
|
- :entity:`Normalizer.Sequence`
|
||||||
|
|
||||||
Sequence([Nmt(), NFKC()])
|
|
||||||
|
|
||||||
|
|
||||||
.. _pre-tokenizers:
|
.. _pre-tokenizers:
|
||||||
@ -142,9 +197,15 @@ the ByteLevel)
|
|||||||
|
|
||||||
Ouput: ``"Hello", "there"``
|
Ouput: ``"Hello", "there"``
|
||||||
|
|
||||||
|
* - Digits
|
||||||
|
- Splits the numbers from any other characters.
|
||||||
|
- Input: ``"Hello123there"``
|
||||||
|
|
||||||
|
Output: ```"Hello", "123", "there"```
|
||||||
|
|
||||||
* - Sequence
|
* - Sequence
|
||||||
- Lets you compose multiple ``PreTokenizer`` that will be run in the given order
|
- Lets you compose multiple ``PreTokenizer`` that will be run in the given order
|
||||||
- ``Sequence([Punctuation(), WhitespaceSplit()])``
|
- :entity:`PreTokenizer.Sequence`
|
||||||
|
|
||||||
|
|
||||||
.. _models:
|
.. _models:
|
||||||
@ -214,7 +275,7 @@ is the component doing just that.
|
|||||||
* - TemplateProcessing
|
* - TemplateProcessing
|
||||||
- Let's you easily template the post processing, adding special tokens, and specifying
|
- Let's you easily template the post processing, adding special tokens, and specifying
|
||||||
the ``type_id`` for each sequence/special token. The template is given two strings
|
the ``type_id`` for each sequence/special token. The template is given two strings
|
||||||
representing the single sequence and the pair of sequences, as well as a set of
|
representing the single sequence and the pair of sequences, as well as a set of
|
||||||
special tokens to use.
|
special tokens to use.
|
||||||
- Example, when specifying a template with these values:
|
- Example, when specifying a template with these values:
|
||||||
|
|
||||||
|
Reference in New Issue
Block a user