论文标题
希伯来语NLP的多语言序列模型
Multilingual Sequence-to-Sequence Models for Hebrew NLP
论文作者
论文摘要
最近的工作将NLP的进度归因于大型语言模型(LMS),并增加了模型大小和大量预处理数据。尽管如此,与其他语言中的LMS相比,与LMS相比,希伯来语当前的最新LMS均被低估和训练不足。此外,先前在验证的希伯来语LMS上的工作专注于仅编码模型。虽然仅编码架构对分类任务有益,但在考虑希伯来语的形态丰富性质时,它对子字预测任务(例如命名实体识别)并不适合。在本文中,我们认为,在形态上丰富的语言(MRL)(例如希伯来语)的情况下,序列到序列生成架构更适合LLM。我们证明,通过将Hebrew NLP管道中的任务施放为文本到文本任务,我们可以利用强大的多语言,预验证的序列到序列模型为MT5,从而消除了对基于专业的,基于词素的,单独进行微调的解码器的需求。使用这种方法,我们的实验比对现有希伯来语NLP基准测试的先前发表的结果进行了实质性改进。这些结果表明,多语言序列到序列模型为MRL提供了有希望的NLP构建块。
Recent work attributes progress in NLP to large language models (LMs) with increased model size and large quantities of pretraining data. Despite this, current state-of-the-art LMs for Hebrew are both under-parameterized and under-trained compared to LMs in other languages. Additionally, previous work on pretrained Hebrew LMs focused on encoder-only models. While the encoder-only architecture is beneficial for classification tasks, it does not cater well for sub-word prediction tasks, such as Named Entity Recognition, when considering the morphologically rich nature of Hebrew. In this paper we argue that sequence-to-sequence generative architectures are more suitable for LLMs in the case of morphologically rich languages (MRLs) such as Hebrew. We demonstrate that by casting tasks in the Hebrew NLP pipeline as text-to-text tasks, we can leverage powerful multilingual, pretrained sequence-to-sequence models as mT5, eliminating the need for a specialized, morpheme-based, separately fine-tuned decoder. Using this approach, our experiments show substantial improvements over previously published results on existing Hebrew NLP benchmarks. These results suggest that multilingual sequence-to-sequence models present a promising building block for NLP for MRLs.