SentencePiece
SentencePiece | |
---|---|
Nome Inglese | SentencePiece |
Sigla |
Tipo di tokenizer spesso utilizzato nei modelli multilingua, a differenza di Wordpiece.
The SentencePiece tokenizer is based on a type of subword segmentation called Unigram and encodes each input text as a sequence of Unicode characters. This last feature is especially useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many languages, like Japanese, do not have whitespace characters. Another special feature of SentencePiece is that whitespace is assigned the Unicode symbol U+2581, or the ▁ character, also called the lower one quarter block character. (Natural Language Processing with Transformers_ Building Language Applications with Hugging Face)