序列序列的序列预处理的序列较低的斯洛文尼亚语

论文标题

序列序列的序列预处理的序列较低的斯洛文尼亚语

Sequence to sequence pretraining for a less-resourced Slovenian language

论文作者

Ulčar, Matej, Robnik-Šikonja, Marko

论文摘要

较大的验证语言模型最近征服了自然语言处理领域。 As an alternative to predominant masked language modelling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which includes masked language model but more naturally fits text generation tasks such as machine translation, summarization, question answering, text simplification, dialogue systems, etc. The monolingual variants of T5 models have been limited to well-resourced languages, while the massively multilingual T5 model支持101种语言。相比之下，我们训练了两个不同尺寸的T5型序列，以使用资源较少的形态学富裕的斯洛文尼亚语言进行序列模型，并在11个任务上分析了它们的行为。关于分类任务，SLOT5模型主要落后于单语Slovene Sloberta模型，但对于生成任务很有用。

Large pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modelling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which includes masked language model but more naturally fits text generation tasks such as machine translation, summarization, question answering, text simplification, dialogue systems, etc. The monolingual variants of T5 models have been limited to well-resourced languages, while the massively multilingual T5 model supports 101 languages. In contrast, we trained two different sized T5-type sequence to sequence models for morphologically rich Slovene language with much less resources and analyzed their behavior on 11 tasks. Concerning classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model but are useful for the generative tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题