SentencePiece: differenze tra le versioni

Versione attuale delle 13:47, 17 ago 2024

SentencePiece
Nome Inglese	SentencePiece
Sigla	SP

Tipo di tokenizer spesso utilizzato nei modelli multilingua, a differenza di Wordpiece.

The SentencePiece tokenizer is based on a type of subword segmentation called Unigram and encodes each input text as a sequence of Unicode characters. This last feature is especially useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many languages, like Japanese, do not have whitespace characters. Another special feature of SentencePiece is that whitespace is assigned the Unicode symbol U+2581, or the character, also called the lower one quarter block character. (Natural Language Processing with Transformers_ Building Language Applications with Hugging Face)

@@ Riga 1: / Riga 1: @@
 {{Template concetto
 |NomeInglese=SentencePiece
+|Sigla=SP
 }}
-Tipo di [[tokenizer]] spesso utilizzato nei modelli multilingua, a differenza di [[Wordpiece]].
+Tipo di [[tokenizer]] spesso utilizzato nei modelli multilingua, a differenza di [[Wordpiece]].<blockquote>The SentencePiece tokenizer is based on a type of subword segmentation called Unigram and encodes each input text as a sequence of Unicode characters. This last feature is especially useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many languages, like Japanese, do not have whitespace characters. Another special feature of SentencePiece is that whitespace is assigned the Unicode symbol U+2581, or the   character, also called the lower one quarter block character. (Natural Language Processing with Transformers_ Building Language Applications with Hugging Face)</blockquote>
+{{#seo:
+            |title=SentencePiece
+            |title_mode=append
+            |keywords=tokenizer, modello multilingue, Wordpiece, subword, Unigram, Unicode, giapponese, whitespace
+            |description=SentencePiece è un tipo di tokenizer basato sulla segmentazione subword Unigram, utilizzato soprattutto nei modelli multilingua. Codifica ogni testo in input come una sequenza di caratteri Unicode, risultando così agnostico ad accenti, punteggiatura e all'assenza di spazi bianchi in lingue come il giapponese.
+            }}