Move to maturing mimicking move for safetensors. + Rewritten node bindings. (#1331)

* Move to maturing mimicking move for `safetensors`. * Tmp. * Fix sdist. * Wat? * Clippy 1.72 * Remove if. * Conda sed. * Fix doc check workflow. * Moving to maturin AND removing http + openssl mess (smoothing transition moving to `huggingface_hub`) * Fix dep * Black. * New node bindings. * Fix docs + node cache ? * Yarn. * Working dir. * Extension module. * Put back interpreter. * Remove cache. * New attempt * Multi python. * Remove FromPretrained. * Remove traces of `fromPretrained`. * Drop 3.12 for windows? * Typo. * Put back the default feature for ignoring links during simple test. * Fix ? * x86_64 -> x64. * Remove warning for windows bindings. * Excluse aarch. * Include/exclude. * Put back workflows in correct states.
2025-08-22 16:25:30 +00:00 · 2023-08-28 16:24:14 +02:00
parent f2952020d5
commit d2010d5165
155 changed files with 12988 additions and 16409 deletions
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@ -105,7 +105,7 @@ one with a BPE model:
        :language: javascript
        :start-after: START init_tokenizer
        :end-before: END init_tokenizer
-        :dedent: 8
+        :dedent: 4

 To train our tokenizer on the wikitext files, we will need to instantiate a `trainer`, in this case
 a :entity:`BpeTrainer`
@ -132,7 +132,7 @@ a :entity:`BpeTrainer`
        :language: javascript
        :start-after: START init_trainer
        :end-before: END init_trainer
-        :dedent: 8
+        :dedent: 4

 We can set the training arguments like :entity:`vocab_size` or :entity:`min_frequency` (here left at
 their default values of 30,000 and 0) but the most important part is to give the
@ -173,7 +173,7 @@ on whitespace.
        :language: javascript
        :start-after: START init_pretok
        :end-before: END init_pretok
-        :dedent: 8
+        :dedent: 4

 Now, we can just call the :entity:`Tokenizer.train` method with any list of files we want
 to use:
@ -200,7 +200,7 @@ to use:
        :language: javascript
        :start-after: START train
        :end-before: END train
-        :dedent: 8
+        :dedent: 4

 This should only take a few seconds to train our tokenizer on the full wikitext dataset!
 To save the tokenizer in one file that contains all its configuration and vocabulary, just use the
@ -228,7 +228,7 @@ To save the tokenizer in one file that contains all its configuration and vocabu
        :language: javascript
        :start-after: START save
        :end-before: END save
-        :dedent: 8
+        :dedent: 4

 and you can reload your tokenizer from that file with the :entity:`Tokenizer.from_file`
 :entity:`classmethod`:
@ -255,7 +255,7 @@ and you can reload your tokenizer from that file with the :entity:`Tokenizer.fro
        :language: javascript
        :start-after: START reload_tokenizer
        :end-before: END reload_tokenizer
-        :dedent: 8
+        :dedent: 4

 Using the tokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -285,7 +285,7 @@ Now that we have trained a tokenizer, we can use it on any text we want with the
        :language: javascript
        :start-after: START encode
        :end-before: END encode
-        :dedent: 8
+        :dedent: 4

 This applied the full pipeline of the tokenizer on the text, returning an
 :entity:`Encoding` object. To learn more about this pipeline, and how to apply (or
@ -317,7 +317,7 @@ tokens:
        :language: javascript
        :start-after: START print_tokens
        :end-before: END print_tokens
-        :dedent: 8
+        :dedent: 4

 Similarly, the :obj:`ids` attribute will contain the index of each of those tokens in the
 tokenizer's vocabulary:
@ -344,7 +344,7 @@ tokenizer's vocabulary:
        :language: javascript
        :start-after: START print_ids
        :end-before: END print_ids
-        :dedent: 8
+        :dedent: 4

 An important feature of the 🤗 Tokenizers library is that it comes with full alignment tracking,
 meaning you can always get the part of your original sentence that corresponds to a given token.
@ -374,7 +374,7 @@ which is the token at index 9 in the list, we can just ask for the offset at the
        :language: javascript
        :start-after: START print_offsets
        :end-before: END print_offsets
-        :dedent: 8
+        :dedent: 4

 and those are the indices that correspond to the emoji in the original sentence:

@ -400,7 +400,7 @@ and those are the indices that correspond to the emoji in the original sentence:
        :language: javascript
        :start-after: START use_offsets
        :end-before: END use_offsets
-        :dedent: 8
+        :dedent: 4

 Post-processing
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -436,7 +436,7 @@ list of special tokens, so this should be their IDs. To double-check, we can use
        :language: javascript
        :start-after: START check_sep
        :end-before: END check_sep
-        :dedent: 8
+        :dedent: 4

 Here is how we can set the post-processing to give us the traditional BERT inputs:

@ -462,7 +462,7 @@ Here is how we can set the post-processing to give us the traditional BERT input
        :language: javascript
        :start-after: START init_template_processing
        :end-before: END init_template_processing
-        :dedent: 8
+        :dedent: 4

 Let's go over this snippet of code in more details. First we specify the template for single
 sentences: those should have the form :obj:`"[CLS] $A [SEP]"` where :obj:`$A` represents our
@ -500,7 +500,7 @@ To check out this worked properly, let's try to encode the same sentence as befo
        :language: javascript
        :start-after: START print_special_tokens
        :end-before: END print_special_tokens
-        :dedent: 8
+        :dedent: 4

 To check the results on a pair of sentences, we just pass the two sentences to
 :entity:`Tokenizer.encode`:
@ -527,7 +527,7 @@ To check the results on a pair of sentences, we just pass the two sentences to
        :language: javascript
        :start-after: START print_special_tokens_pair
        :end-before: END print_special_tokens_pair
-        :dedent: 8
+        :dedent: 4

 You can then check the type IDs attributed to each token is correct with

@ -553,7 +553,7 @@ You can then check the type IDs attributed to each token is correct with
        :language: javascript
        :start-after: START print_type_ids
        :end-before: END print_type_ids
-        :dedent: 8
+        :dedent: 4

 If you save your tokenizer with :entity:`Tokenizer.save`, the post-processor will be saved along.

@ -585,7 +585,7 @@ using the :entity:`Tokenizer.encode_batch` method:
        :language: javascript
        :start-after: START encode_batch
        :end-before: END encode_batch
-        :dedent: 8
+        :dedent: 4

 The output is then a list of :entity:`Encoding` objects like the ones we saw before. You
 can process together as many texts as you like, as long as it fits in memory.
@ -616,7 +616,7 @@ B:
        :language: javascript
        :start-after: START encode_batch_pair
        :end-before: END encode_batch_pair
-        :dedent: 8
+        :dedent: 4

 When encoding multiple sentences, you can automatically pad the outputs to the longest sentence
 present by using :entity:`Tokenizer.enable_padding`, with the :entity:`pad_token` and its ID
@ -645,7 +645,7 @@ present by using :entity:`Tokenizer.enable_padding`, with the :entity:`pad_token
        :language: javascript
        :start-after: START enable_padding
        :end-before: END enable_padding
-        :dedent: 8
+        :dedent: 4

 We can set the :obj:`direction` of the padding (defaults to the right) or a given :obj:`length` if
 we want to pad every sample to that specific number (here we leave it unset to pad to the size of
@ -673,7 +673,7 @@ the longest text).
        :language: javascript
        :start-after: START print_batch_tokens
        :end-before: END print_batch_tokens
-        :dedent: 8
+        :dedent: 4

 In this case, the `attention mask` generated by the tokenizer takes the padding into account:

@ -699,7 +699,7 @@ In this case, the `attention mask` generated by the tokenizer takes the padding
        :language: javascript
        :start-after: START print_attention_mask
        :end-before: END print_attention_mask
-        :dedent: 8
+        :dedent: 4

 .. _pretrained: