论文标题
超越三重
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation
论文作者
论文摘要
多模式机器翻译(MMT)旨在通过结合其他方式(例如视觉)的信息来提高翻译质量。以前的MMT系统主要集中于更好地访问和使用视觉信息,并倾向于在与图像相关的数据集上验证其方法。这些研究面临两个挑战。首先,他们只能利用三重数据(带有图像的双语文本),这是稀缺的;其次,当前的基准测试相对限制,并且与现实情况不符。因此,本文相应地为MMT建立了新的方法和新数据集。首先,我们提出了一个框架2/3-三个框架,采用两种新方法来增强MMT,通过利用大型非三重数据:单语言图像文本数据和仅平行文本数据。其次,我们构建了一个英语 - 中心{e} -Commercial {m} ulti {m} odal {t} ranslation dataset(包括培训和测试),名为EMMT,在其中精心选择了其测试集,因为某些单词是模棱两可的,并且应在没有图像的帮助的情况下将其错误地翻译而成。实验表明,我们的方法更适合现实世界情景,可以使用更多的非三重数据来显着改善翻译性能。此外,我们的模型还与常规多模式翻译基准中的各种SOTA模型媲美。
Multimodal machine translation (MMT) aims to improve translation quality by incorporating information from other modalities, such as vision. Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets. These studies face two challenges. First, they can only utilize triple data (bilingual texts with images), which is scarce; second, current benchmarks are relatively restricted and do not correspond to realistic scenarios. Therefore, this paper correspondingly establishes new methods and new datasets for MMT. First, we propose a framework 2/3-Triplet with two new approaches to enhance MMT by utilizing large-scale non-triple data: monolingual image-text data and parallel text-only data. Second, we construct an English-Chinese {e}-commercial {m}ulti{m}odal {t}ranslation dataset (including training and testing), named EMMT, where its test set is carefully selected as some words are ambiguous and shall be translated mistakenly without the help of images. Experiments show that our method is more suitable for real-world scenarios and can significantly improve translation performance by using more non-triple data. In addition, our model also rivals various SOTA models in conventional multimodal translation benchmarks.