论文标题
通过错误分析进行类似人类的自然语言生成评估
Toward Human-Like Evaluation for Natural Language Generation with Error Analysis
论文作者
论文摘要
最先进的基于语言模型的自动指标,例如BARTSCORE受益于大规模上下文化的预训练,已成功地用于广泛的自然语言生成(NLG)任务,包括机器翻译,文本摘要和数据对文本。最近的研究表明,考虑到两个主要错误(例如,被误译的令牌)和小错误(例如流利性的不完美)都可以产生高质量的人类判断。这激发了我们通过自动错误分析实现评估指标(类似人类评估)的最终目标。为此,我们通过结合类似人类的错误分析策略(即Bartscore ++)来增强BartScore,其中最终得分均由重大错误和小错误的评估。实验结果表明,Bartscore ++可以始终如一地提高Vanilla Bartscore的性能,并在25个测试设置中的20个中胜过现有的最高得分指标。我们希望我们的技术也可以扩展到其他基于模型的预训练的指标。我们将发布我们的代码和脚本以促进社区。
The state-of-the-art language model-based automatic metrics, e.g. BARTScore, benefiting from large-scale contextualized pre-training, have been successfully used in a wide range of natural language generation (NLG) tasks, including machine translation, text summarization, and data-to-text. Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors (e.g. imperfections in fluency) can produce high-quality human judgments. This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis. To this end, we augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors. Experimental results show that BARTScore++ can consistently improve the performance of vanilla BARTScore and outperform existing top-scoring metrics in 20 out of 25 test settings. We hope our technique can also be extended to other pre-trained model-based metrics. We will release our code and scripts to facilitate the community.