EMS：有效有效的多语言嵌入学习

论文标题

EMS：有效有效的多语言嵌入学习

EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning

论文作者

Mao, Zhuoyuan, Chu, Chenhui, Kurohashi, Sadao

论文摘要

大量多语言句子表示模型，例如Laser，Sbert-Distill和Labse，有助于显着改善跨语性的下游任务。但是，使用大量数据或效率低下的模型体系结构会导致大量计算，以根据我们的首选语言和域来训练新模型。为了解决此问题，我们使用跨语性令牌级重建（XTR）和句子级对比度学习作为培训目标，引入了有效有效的大量多语言嵌入（EMS）。与相关研究相比，可以使用更少的平行句子和GPU计算资源进行有效培训所提出的模型。经验结果表明，提出的模型在跨语义句检索，零拍，跨语性类型分类和情感分类方面可以显着或可比的结果。消融分析证明了所提出模型的每个组成部分的效率和有效性。我们发布了模型培训的代码和EMS预先训练的句子嵌入模型，该模型支持62种语言（https://github.com/mao-ku/ems）。

Massively multilingual sentence representation models, e.g., LASER, SBERT-distill, and LaBSE, help significantly improve cross-lingual downstream tasks. However, the use of a large amount of data or inefficient model architectures results in heavy computation to train a new model according to our preferred languages and domains. To resolve this issue, we introduce efficient and effective massively multilingual sentence embedding (EMS), using cross-lingual token-level reconstruction (XTR) and sentence-level contrastive learning as training objectives. Compared with related studies, the proposed model can be efficiently trained using significantly fewer parallel sentences and GPU computation resources. Empirical results showed that the proposed model significantly yields better or comparable results with regard to cross-lingual sentence retrieval, zero-shot cross-lingual genre classification, and sentiment classification. Ablative analyses demonstrated the efficiency and effectiveness of each component of the proposed model. We release the codes for model training and the EMS pre-trained sentence embedding model, which supports 62 languages ( https://github.com/Mao-KU/EMS ).

下载PDF全文

下载文献需遵守相关版权规定

论文标题