mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Doc - Update components page
This commit is contained in:
@ -6,6 +6,52 @@ to customize its behavior. This page lists most provided components.
|
||||
|
||||
.. _normalizers:
|
||||
|
||||
|
||||
.. entities:: python
|
||||
|
||||
BertNormalizer.clean_text
|
||||
clean_text
|
||||
BertNormalizer.handle_chinese_chars
|
||||
handle_chinese_chars
|
||||
BertNormalizer.strip_accents
|
||||
strip_accents
|
||||
BertNormalizer.lowercase
|
||||
lowercase
|
||||
Normalizer.Sequence
|
||||
``Sequence([NFKC(), Lowercase()])``
|
||||
PreTokenizer.Sequence
|
||||
``Sequence([Punctuation(), WhitespaceSplit()])``
|
||||
|
||||
.. entities:: rust
|
||||
|
||||
BertNormalizer.clean_text
|
||||
clean_text
|
||||
BertNormalizer.handle_chinese_chars
|
||||
handle_chinese_chars
|
||||
BertNormalizer.strip_accents
|
||||
strip_accents
|
||||
BertNormalizer.lowercase
|
||||
lowercase
|
||||
Normalizer.Sequence
|
||||
``Sequence::new(vec![NFKC, Lowercase])``
|
||||
PreTokenizer.Sequence
|
||||
``Sequence::new(vec![Punctuation, WhitespaceSplit])``
|
||||
|
||||
.. entities:: node
|
||||
|
||||
BertNormalizer.clean_text
|
||||
cleanText
|
||||
BertNormalizer.handle_chinese_chars
|
||||
handleChineseChars
|
||||
BertNormalizer.strip_accents
|
||||
stripAccents
|
||||
BertNormalizer.lowercase
|
||||
lowercase
|
||||
Normalizer.Sequence
|
||||
..
|
||||
PreTokenizer.Sequence
|
||||
..
|
||||
|
||||
Normalizers
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
||||
@ -65,11 +111,20 @@ The ``Normalizer`` is optional.
|
||||
Input: ``"banana"``
|
||||
Ouput: ``"benene"``
|
||||
|
||||
* - BertNormalizer
|
||||
- Provides an implementation of the Normalizer used in the original BERT. Options
|
||||
that can be set are:
|
||||
|
||||
- :entity:`BertNormalizer.clean_text`
|
||||
- :entity:`BertNormalizer.handle_chinese_chars`
|
||||
- :entity:`BertNormalizer.strip_accents`
|
||||
- :entity:`BertNormalizer.lowercase`
|
||||
|
||||
-
|
||||
|
||||
* - Sequence
|
||||
- Composes multiple normalizers that will run in the provided order
|
||||
- Example::
|
||||
|
||||
Sequence([Nmt(), NFKC()])
|
||||
- :entity:`Normalizer.Sequence`
|
||||
|
||||
|
||||
.. _pre-tokenizers:
|
||||
@ -142,9 +197,15 @@ the ByteLevel)
|
||||
|
||||
Ouput: ``"Hello", "there"``
|
||||
|
||||
* - Digits
|
||||
- Splits the numbers from any other characters.
|
||||
- Input: ``"Hello123there"``
|
||||
|
||||
Output: ```"Hello", "123", "there"```
|
||||
|
||||
* - Sequence
|
||||
- Lets you compose multiple ``PreTokenizer`` that will be run in the given order
|
||||
- ``Sequence([Punctuation(), WhitespaceSplit()])``
|
||||
- :entity:`PreTokenizer.Sequence`
|
||||
|
||||
|
||||
.. _models:
|
||||
|
Reference in New Issue
Block a user