论文标题
通过文档最小风险培训解决暴露偏见:WMT20生物医学翻译任务的剑桥
Addressing Exposure Bias With Document Minimum Risk Training: Cambridge at the WMT20 Biomedical Translation Task
论文作者
论文摘要
2020年WMT生物医学翻译任务评估了MEDLINE抽象翻译。这是一项小域翻译任务,这意味着有限的相关培训数据具有非常不同的风格和词汇。在此类数据上训练的模型容易受到暴露偏见的影响,尤其是当训练对是彼此不完善的翻译时。如果模型学会忽略源句子,这可能会导致推断期间的行为不良。 Unicam条目在微调期间使用强大的风险训练解决了此问题。我们将这种方法与数据过滤相比,以消除“问题”培训示例。在MRT微调下,我们可以为英国 - 德国和英国 - 西班牙生物医学翻译的方向取得良好的结果。特别是,尽管只使用没有连续的单个模型,但我们取得了最佳的英语对西班牙翻译结果和第二好的西班牙对英语结果。
The 2020 WMT Biomedical translation task evaluated Medline abstract translations. This is a small-domain translation task, meaning limited relevant training data with very distinct style and vocabulary. Models trained on such data are susceptible to exposure bias effects, particularly when training sentence pairs are imperfect translations of each other. This can result in poor behaviour during inference if the model learns to neglect the source sentence. The UNICAM entry addresses this problem during fine-tuning using a robust variant on Minimum Risk Training. We contrast this approach with data-filtering to remove `problem' training examples. Under MRT fine-tuning we obtain good results for both directions of English-German and English-Spanish biomedical translation. In particular we achieve the best English-to-Spanish translation result and second-best Spanish-to-English result, despite using only single models with no ensembling.