mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
10
README.md
10
README.md
@ -59,7 +59,7 @@ Then training your tokenizer on a set of files just takes two lines of codes:
|
||||
from tokenizers.trainers import BpeTrainer
|
||||
|
||||
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
|
||||
tokenizer.train(trainer, ["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"])
|
||||
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)
|
||||
```
|
||||
|
||||
Once your tokenizer is trained, encode any text with just one line:
|
||||
@ -70,9 +70,5 @@ print(output.tokens)
|
||||
```
|
||||
|
||||
Check the [python documentation](https://huggingface.co/docs/tokenizers/python/latest) or the
|
||||
[python quicktour](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) to learn more!
|
||||
|
||||
## Contributors
|
||||
|
||||
[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/0)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/1)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/2)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/3)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/4)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/5)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/6)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/7)
|
||||
|
||||
[python quicktour](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) to learn
|
||||
more!
|
||||
|
Reference in New Issue
Block a user