Actually adding docs.

2025-08-22 16:25:30 +00:00 · 2020-09-25 21:46:41 +02:00
parent 655809c718
commit 81bb4f6da3
5 changed files with 292 additions and 1 deletions
--- a/.gitignore
+++ b/.gitignore
@ -10,7 +10,8 @@ Cargo.lock
 /data
 tokenizers/data
 bindings/python/tests/data
-/docs
+docs/build/
 docs/make.bat
 __pycache__
 pip-wheel-metadata
--- a/docs/Makefile
+++ b/docs/Makefile
@ -0,0 +1,20 @@
 # Minimal makefile for Sphinx documentation
 #
 # You can set these variables from the command line, and also
 # from the environment for the first two.
 SPHINXOPTS    ?=
 SPHINXBUILD   ?= sphinx-build
 SOURCEDIR     = source
 BUILDDIR      = build
 # Put it first so that "make" without argument is like "make help".
 help:
 	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
 .PHONY: help Makefile
 # Catch-all target: route all unknown targets to Sphinx using the new
 # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
 %: Makefile
 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -0,0 +1,56 @@
 # Configuration file for the Sphinx documentation builder.
 #
 # This file only contains a selection of the most common options. For a full
 # list see the documentation:
 # https://www.sphinx-doc.org/en/master/usage/configuration.html
 # -- Path setup --------------------------------------------------------------
 # If extensions (or modules to document with autodoc) are in another directory,
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 #
 # import os
 # import sys
 # sys.path.insert(0, os.path.abspath('.'))
 # -- Project information -----------------------------------------------------
 project = "tokenizers"
 copyright = "2020, huggingface"
 author = "huggingface"
 # The full version, including alpha/beta/rc tags
 release = "0.9.0"
 # -- General configuration ---------------------------------------------------
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
 extensions = [
    "sphinx_tabs.tabs",
 ]
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ["_templates"]
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
 # This pattern also affects html_static_path and html_extra_path.
 exclude_patterns = []
 # -- Options for HTML output -------------------------------------------------
 # The theme to use for HTML and HTML Help pages.  See the documentation for
 # a list of builtin themes.
 #
 html_theme = "sphinx_rtd_theme"
 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
 # so a file named "default.css" will overwrite the builtin "default.css".
 html_static_path = ["_static"]
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -0,0 +1,79 @@
 .. tokenizers documentation master file, created by
   sphinx-quickstart on Fri Sep 25 14:32:54 2020.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.
 Welcome to tokenizers's documentation!
 ======================================
 .. toctree::
    tokenizer_blocks
 Getting started
 ==================
 Provides an implementation of today's most used tokenizers, with a focus on performance and
 versatility.
 Main features:
 --------------
 - Train new vocabularies and tokenize, using today's most used tokenizers.
 - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
   less than 20 seconds to tokenize a GB of text on a server's CPU.
 - Easy to use, but also extremely versatile.
 - Designed for research and production.
 - Normalization comes with alignments tracking. It's always possible to get the part of the
   original sentence that corresponds to a given token.
 - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
 - Bindings to Rust, Python and Node.
 Load an existing tokenizer:
 ---------------------------
 .. tabs::
   .. group-tab:: Rust
      .. literalinclude:: ../../tokenizers/examples/load.rs
         :language: rust
         :emphasize-lines: 4
   .. group-tab:: Python
      .. literalinclude:: ../../bindings/python/tests/examples/test_load.py
         :language: python
         :emphasize-lines: 4
   .. group-tab:: Node
      .. literalinclude:: ../../bindings/node/examples/load.test.ts
         :language: typescript
         :emphasize-lines: 11
 Train a tokenizer:
 ------------------
 Small guide of :ref:`how to create a Tokenizer options<tokenizer_blocks>`.
 .. tabs::
   .. group-tab:: Rust
      .. literalinclude:: ../../tokenizers/examples/train.rs
         :language: rust
   .. group-tab:: Python
      .. literalinclude:: ../../bindings/python/tests/examples/test_train.py
         :language: python
   .. group-tab:: Node
      .. literalinclude:: ../../bindings/node/examples/train.test.ts
         :language: typescript
--- a/docs/source/tokenizer_blocks.rst
+++ b/docs/source/tokenizer_blocks.rst
@ -0,0 +1,135 @@
 Models
 ======
 .. _tokenizer_blocks:
 Models are the core algorithms that serves for tokenizers.
 .. list-table::
   :header-rows: 1
   * - Name
     - Description
   * - BPE
     - Works by looking at most frequent pairs in a dataset, and iteratively fusing them in new tokens
   * - Unigram
     - Works by building a suffix array and using an EM algorithm to find best suitable tokens
   * - WordPiece
     - ...
   * - WordLevel
     - ...
 Normalizers
 ===========
 A normalizer will take a unicode string input, and modify it to make it more uniform for the underlying algorithm.
 Usually fixes some unicode quirks. The specificity of ``tokenizers`` is that we keep track of all offsets
 to know how a string was normalizers, which is especially useful to debug a tokenizer.
 .. list-table::
   :header-rows: 1
   * - Name
     - Desription
     - Example
   * - NFD
     - NFD unicode normalization
     - 
   * - NFKD
     - NFKD unicode normalization
     - 
   * - NFC
     - NFC unicode normalization
     - 
   * - NFKC
     - NFKC unicode normalization
     - 
   * - Lowercase
     - Replaces all uppercase to lowercase
     - "HELLO ὈΔΥΣΣΕΎΣ" -> "hello ὀδυσσεύς"
   * - Strip
     - Removes all spacelike characters on the sides of input
     - " hi " ->  "hi"
   * - StripAccents
     - Removes all accent symbols in unicode (to be used with NFD for consistency)
     - "é" -> "e"
   * - Nmt
     - Removes some control characters and zero-width characters
     - "\u200d" -> ""
   * - Replace
     - Replaces a custom string or regexp and changes it with given content
     - Replace("a", "e")("banana") -> "benene"
   * - Sequence
     - Composes multiple normalizers
     - Sequence([Nmt(), NFKC()])
 Pre tokenizers
 ==============
 A pre tokenizer splits an input string *before* it reaches the model, it's often used for efficiency.
 It can also replace some characters.
 .. list-table::
   :header-rows: 1
   * - Name
     - Description
     - Example
   * - ByteLevel
     - Splits on spaces but remaps all bytes into visible range (used in gpt-2)
     - "Hello my friend, how are you?" -> "Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"
   * - Whitespace
     - Splits on word boundaries
     - "Hello there!" -> "Hello", "there", "!"
   * - WhitespaceSplit
     - Splits on spaces
     - "Hello there!" -> "Hello", "there!"
   * - Punctuation
     - Will isolate all punctuation characters
     - "Hello?" -> "Hello", "?"
   * - Metaspace
     - Splits on spaces an replaces it with a special char
     - Metaspace("_", false)("Hello there") -> "Hello", "_there"
   * - CharDelimiterSplit
     - Splits on a given char
     - CharDelimiterSplit("x")("Helloxthere") -> "Hello", "there"
   * - Sequence
     - Composes multiple pre_tokenizers
     - Sequence([Punctuation(), WhitespaceSplit()])
 Decoders
 ========
 As some normalizers and pre_tokenizers change some characters, we want to revert some changes to get back readable strings
 .. list-table::
   :header-rows: 1
   * - Name
     - Description
   * - ByteLevel
     - Reverts ByteLevel Pre_tokenizer
   * - Metaspace
     - Reverts Metaspace Pre_tokenizer
 PostProcessor
 =============
 After the whole pipeline, we sometimes want to insert some specific markers before feeding
 a tokenized string into a model like "`<cls>` My horse is amazing `<eos>`".
 .. list-table::
   :header-rows: 1
   * - Name
     - Description
     - Example
   * - TemplateProcessing
     - It should covert most needs. `seq_a` is a list of the outputs for single sentence, `seq_b` is used when encoding two sentences
     - TemplateProcessing(seq_a = ["<cls>", "$0", "<eos>"], seq_b = ["$1", "<eos>"]) ("I like this", "but not this") -> "<cls> I like this <eos> but not this <eos>"