mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
12
README.md
12
README.md
@ -59,7 +59,7 @@ Then training your tokenizer on a set of files just takes two lines of codes:
|
|||||||
from tokenizers.trainers import BpeTrainer
|
from tokenizers.trainers import BpeTrainer
|
||||||
|
|
||||||
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
|
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
|
||||||
tokenizer.train(trainer, ["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"])
|
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)
|
||||||
```
|
```
|
||||||
|
|
||||||
Once your tokenizer is trained, encode any text with just one line:
|
Once your tokenizer is trained, encode any text with just one line:
|
||||||
@ -69,10 +69,6 @@ print(output.tokens)
|
|||||||
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
|
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
|
||||||
```
|
```
|
||||||
|
|
||||||
Check the [python documentation](https://huggingface.co/docs/tokenizers/python/latest) or the
|
Check the [python documentation](https://huggingface.co/docs/tokenizers/python/latest) or the
|
||||||
[python quicktour](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) to learn more!
|
[python quicktour](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) to learn
|
||||||
|
more!
|
||||||
## Contributors
|
|
||||||
|
|
||||||
[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/0)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/1)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/2)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/3)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/4)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/5)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/6)[](https://sourcerer.io/fame/clmnt/huggingface/tokenizers/links/7)
|
|
||||||
|
|
||||||
|
Reference in New Issue
Block a user