SentencePiece: differenze tra le versioni

Da Wiki AI.
Nessun oggetto della modifica
Nessun oggetto della modifica
Riga 2: Riga 2:
|NomeInglese=SentencePiece
|NomeInglese=SentencePiece
}}
}}
Tipo di [[tokenizer]] spesso utilizzato nei modelli multilingua, a differenza di [[Wordpiece]].
Tipo di [[tokenizer]] spesso utilizzato nei modelli multilingua, a differenza di [[Wordpiece]].<blockquote>The SentencePiece tokenizer is based on a type of subword segmentation called Unigram and encodes each input text as a sequence of Unicode characters. This last feature is especially useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many languages, like Japanese, do not have whitespace characters. Another special feature of SentencePiece is that whitespace is assigned the Unicode symbol U+2581, or the ▁ character, also called the lower one quarter block character. (Natural Language Processing with Transformers_ Building Language Applications with Hugging Face)</blockquote>

Versione delle 20:05, 7 lug 2024

SentencePiece
Nome Inglese SentencePiece
Sigla

Tipo di tokenizer spesso utilizzato nei modelli multilingua, a differenza di Wordpiece.

The SentencePiece tokenizer is based on a type of subword segmentation called Unigram and encodes each input text as a sequence of Unicode characters. This last feature is especially useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many languages, like Japanese, do not have whitespace characters. Another special feature of SentencePiece is that whitespace is assigned the Unicode symbol U+2581, or the ▁ character, also called the lower one quarter block character. (Natural Language Processing with Transformers_ Building Language Applications with Hugging Face)