mirror of
https://github.com/mii443/tokenizers.git
synced 2025-08-22 16:25:30 +00:00
Update README.md
This commit is contained in:
14
README.md
14
README.md
@ -42,8 +42,8 @@ Start using in a matter of seconds:
|
||||
# Tokenizers can be easily instantiated from standard files
|
||||
>>> tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
|
||||
Tokenizer(vocabulary_size=30522, model=BertWordPiece, add_special_tokens=True, unk_token=[UNK],
|
||||
sep_token=[SEP], cls_token=[CLS], clean_text=True, handle_chinese_chars=True,
|
||||
strip_accents=True, lowercase=True, wordpieces_prefix=##)
|
||||
sep_token=[SEP], cls_token=[CLS], clean_text=True, handle_chinese_chars=True,
|
||||
strip_accents=True, lowercase=True, wordpieces_prefix=##)
|
||||
|
||||
# Tokenizers provide exhaustive outputs: tokens, mapping to original string, attention/special token masks.
|
||||
# They also handle model's max input lengths as well as padding (to directly encode in padded batches)
|
||||
@ -63,11 +63,11 @@ And training an new vocabulary is just as easy:
|
||||
|
||||
```python
|
||||
# You can also train a BPE/Byte-levelBPE/WordPiece vocabulary on your own files
|
||||
tokenizer = ByteLevelBPETokenizer()
|
||||
tokenizer.train(["wiki.test.raw"], vocab_size=20000)
|
||||
>>> [00:00:00] Tokenize words ████████████████████████████████████████ 20993/20993
|
||||
>>> [00:00:00] Count pairs ████████████████████████████████████████ 20993/20993
|
||||
>>> [00:00:03] Compute merges ████████████████████████████████████████ 19375/19375
|
||||
>>> tokenizer = ByteLevelBPETokenizer()
|
||||
>>> tokenizer.train(["wiki.test.raw"], vocab_size=20000)
|
||||
[00:00:00] Tokenize words ████████████████████████████████████████ 20993/20993
|
||||
[00:00:00] Count pairs ████████████████████████████████████████ 20993/20993
|
||||
[00:00:03] Compute merges ████████████████████████████████████████ 19375/19375
|
||||
```
|
||||
|
||||
## Bindings
|
||||
|
Reference in New Issue
Block a user