总结并生成反向译本：编程语言的无监督翻译

论文标题

总结并生成反向译本：编程语言的无监督翻译

Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages

论文作者

Ahmad, Wasi Uddin, Chakraborty, Saikat, Ray, Baishakhi, Chang, Kai-Wei

论文摘要

当几乎没有平行数据时，反向翻译以其在神经机器翻译中的有效性而广为人知。在这种方法中，源对目标模型与并行训练的目标对源模型相结合。目标对源模型会生成嘈杂的来源，而源头对目标模型则经过训练以重建目标，反之亦然。对编程语言的多语言预训练的序列模型的最新发展对于多种下游软件工程任务非常有效。因此，培训他们通过反翻译构建编程语言翻译系统是令人信服的。但是，这些模型无法通过反翻译进一步培训，因为它们学会以与预训练期间的输入相同的语言输出序列。作为替代方案，我们建议通过代码摘要和生成执行反翻译。在代码摘要中，一个模型学会生成自然语言（NL）摘要给定代码段。在代码生成中，该模型学会了相反的事情。因此，反向翻译中的目标对源产生可以看作是目标到NL-to-to-to-yl-to-ylce生成。我们表明，我们提出的方法通过最先进的方法竞争性能。我们已公开提供代码。

Back-translation is widely known for its effectiveness in neural machine translation when there is little to no parallel data. In this approach, a source-to-target model is coupled with a target-to-source model trained in parallel. The target-to-source model generates noisy sources, while the source-to-target model is trained to reconstruct the targets and vice versa. Recent developments of multilingual pre-trained sequence-to-sequence models for programming languages have been very effective for a broad spectrum of downstream software engineering tasks. Hence, training them to build programming language translation systems via back-translation is compelling. However, these models cannot be further trained via back-translation since they learn to output sequences in the same language as the inputs during pre-training. As an alternative, we propose performing back-translation via code summarization and generation. In code summarization, a model learns to generate natural language (NL) summaries given code snippets. In code generation, the model learns to do the opposite. Therefore, target-to-source generation in back-translation can be viewed as a target-to-NL-to-source generation. We show that our proposed approach performs competitively with state-of-the-art methods. We have made the code publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题