mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-23 00:35:35 +00:00
* Draft functionality of visualization * Added comments to make code more intelligble * polish the styles * Ensure colors are stable and comment the css * Code clean up * Made visualizer importable and added some docs * Fix styling * implement comments from PR * Fixed the regex for UNK tokens and examples in notebook * Converted docs to google format * Added a notebook showing multiple languages and tokenizers * Added visual indication of chars that are tokenized with >1 token * Reorganize things a bit and fix import * Update docs Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
91 lines
2.4 KiB
PHP
91 lines
2.4 KiB
PHP
Input sequences
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
These types represent all the different kinds of sequence that can be used as input of a Tokenizer.
|
|
Globally, any sequence can be either a string or a list of strings, according to the operating
|
|
mode of the tokenizer: ``raw text`` vs ``pre-tokenized``.
|
|
|
|
.. autodata:: tokenizers.TextInputSequence
|
|
|
|
.. autodata:: tokenizers.PreTokenizedInputSequence
|
|
|
|
.. autodata:: tokenizers.InputSequence
|
|
|
|
|
|
Encode inputs
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
These types represent all the different kinds of input that a :class:`~tokenizers.Tokenizer` accepts
|
|
when using :meth:`~tokenizers.Tokenizer.encode_batch`.
|
|
|
|
.. autodata:: tokenizers.TextEncodeInput
|
|
|
|
.. autodata:: tokenizers.PreTokenizedEncodeInput
|
|
|
|
.. autodata:: tokenizers.EncodeInput
|
|
|
|
|
|
Tokenizer
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: tokenizers.Tokenizer
|
|
:members:
|
|
|
|
|
|
Encoding
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: tokenizers.Encoding
|
|
:members:
|
|
|
|
|
|
Added Tokens
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: tokenizers.AddedToken
|
|
:members:
|
|
|
|
|
|
Models
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. automodule:: tokenizers.models
|
|
:members:
|
|
|
|
Normalizers
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. automodule:: tokenizers.normalizers
|
|
:members:
|
|
|
|
|
|
Pre-tokenizers
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. automodule:: tokenizers.pre_tokenizers
|
|
:members:
|
|
|
|
|
|
Post-processor
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. automodule:: tokenizers.processors
|
|
:members:
|
|
|
|
|
|
Trainers
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. automodule:: tokenizers.trainers
|
|
:members:
|
|
|
|
|
|
Visualizer
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: tokenizers.tools.Annotation
|
|
:members:
|
|
|
|
.. autoclass:: tokenizers.tools.EncodingVisualizer
|
|
:members: __call__
|