论文标题

迭代前进的半监督低资源风格转移印尼非正式的正式语言转移

Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation

论文作者

Wibowo, Haryo Akbarianto, Prawiro, Tatag Aziz, Ihsan, Muhammad, Aji, Alham Fikri, Prasojo, Radityo Eko, Mahendra, Rahmad, Fitriany, Suci

论文摘要

在日常使用中,印尼语言充满了非正式性,也就是说,就词汇,拼写和单词顺序而言,与标准的偏差。另一方面,当前可用的印尼NLP模型通常是在标准印度尼西亚人中开发的。在这项工作中,我们将风格转移从非正式的印尼语到正式的印尼语作为低资源的机器翻译问题。我们构建了非正式印尼人及其正式同行的并行句子的新数据集。我们基准了几种从非正式转移到正式印尼人的风格转移的策略。我们还探索了使用人工正面翻译数据的增强培训集。由于我们正在处理一个极低的资源设置,因此我们发现基于短语的机器翻译方法优于基于变压器的方法。另外,预先训练的GPT-2对此任务进行了罚款,同样出色,但要花费更多的计算资源。我们的发现显示了朝着利用机器翻译模型进行样式转移的有前途的一步。我们的代码和数据可在https://github.com/haryoa/stif-indonesia中找到

In its daily use, the Indonesian language is riddled with informality, that is, deviations from the standard in terms of vocabulary, spelling, and word order. On the other hand, current available Indonesian NLP models are typically developed with the standard Indonesian in mind. In this work, we address a style-transfer from informal to formal Indonesian as a low-resource machine translation problem. We build a new dataset of parallel sentences of informal Indonesian and its formal counterpart. We benchmark several strategies to perform style transfer from informal to formal Indonesian. We also explore augmenting the training set with artificial forward-translated data. Since we are dealing with an extremely low-resource setting, we find that a phrase-based machine translation approach outperforms the Transformer-based approach. Alternatively, a pre-trained GPT-2 fined-tuned to this task performed equally well but costs more computational resource. Our findings show a promising step towards leveraging machine translation models for style transfer. Our code and data are available in https://github.com/haryoa/stif-indonesia

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源