明确的对齐方式是否可靠地改进多语言编码器？

论文标题

明确的对齐方式是否可靠地改进多语言编码器？

Do Explicit Alignments Robustly Improve Multilingual Encoders?

论文作者

Wu, Shijie, Dredze, Mark

论文摘要

多语言Bert（Mbert），XLM-Roberta（XLMR）和其他无监督的多语言编码器可以有效地学习跨语性表示。已经证明，基于Europarl或Multiun等bitexts的明确对齐目标已被证明可以进一步改善这些表示形式。但是，单词级别的对齐通常是次优的，并且对于许多语言而言，这种bitexts不可用。在本文中，我们提出了一个可以更好地利用此类信号的新的对比对准目标，并检查是否可以将这些先前的对准方法适应更嘈杂的对齐数据的来源：随机采样了Opus集合的100万对子集。此外，我们没有在单个模型运行的单个数据集上报告结果，而是在四个数据集和任务上报告了具有不同种子的多个运行的平均值和标准推导。我们更广泛的分析发现，尽管我们的新目标优于先前的工作，但总的来说，这些方法并不能通过更强大的评估框架提高性能。此外，使用更好的基础模型将递增对齐训练的任何好处。这些负面的结果决定了评估这些方法的更多护理，并提出了应用明确的一致性目标的局限性。

Multilingual BERT (mBERT), XLM-RoBERTa (XLMR) and other unsupervised multilingual encoders can effectively learn cross-lingual representation. Explicit alignment objectives based on bitexts like Europarl or MultiUN have been shown to further improve these representations. However, word-level alignments are often suboptimal and such bitexts are unavailable for many languages. In this paper, we propose a new contrastive alignment objective that can better utilize such signal, and examine whether these previous alignment methods can be adapted to noisier sources of aligned data: a randomly sampled 1 million pair subset of the OPUS collection. Additionally, rather than report results on a single dataset with a single model run, we report the mean and standard derivation of multiple runs with different seeds, on four datasets and tasks. Our more extensive analysis finds that, while our new objective outperforms previous work, overall these methods do not improve performance with a more robust evaluation framework. Furthermore, the gains from using a better underlying model eclipse any benefits from alignment training. These negative results dictate more care in evaluating these methods and suggest limitations in applying explicit alignment objectives.

下载PDF全文

下载文献需遵守相关版权规定

论文标题