Add a visualization utility to render tokens and annotations in a notebook (#508)

* Draft functionality of visualization

* Added comments to make code more intelligble

* polish the styles

* Ensure colors are stable and comment the css

* Code clean up

* Made visualizer importable and added some docs

* Fix styling

* implement comments from PR

* Fixed the regex for UNK tokens and examples in notebook

* Converted docs to google format

* Added a notebook showing multiple languages and tokenizers

* Added visual indication of chars that are tokenized with >1 token

* Reorganize things a bit and fix import

* Update docs

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
This commit is contained in:
Tal Perry
2020-12-04 16:25:56 +01:00
committed by GitHub
parent 5549fc4837
commit 8916b6bb27
8 changed files with 1651 additions and 1 deletions

View File

@ -41,6 +41,7 @@ setup(
"tokenizers.processors",
"tokenizers.trainers",
"tokenizers.implementations",
"tokenizers.tools",
],
package_data={
"tokenizers": ["py.typed", "__init__.pyi"],
@ -51,6 +52,7 @@ setup(
"tokenizers.processors": ["py.typed", "__init__.pyi"],
"tokenizers.trainers": ["py.typed", "__init__.pyi"],
"tokenizers.implementations": ["py.typed"],
"tokenizers.tools": ["py.typed", "visualizer-styles.css"],
},
zip_safe=False,
)