通过错误分析进行类似人类的自然语言生成评估

论文标题

通过错误分析进行类似人类的自然语言生成评估

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

论文作者

Lu, Qingyu, Ding, Liang, Xie, Liping, Zhang, Kanjian, Wong, Derek F., Tao, Dacheng

论文摘要

最先进的基于语言模型的自动指标，例如BARTSCORE受益于大规模上下文化的预训练，已成功地用于广泛的自然语言生成（NLG）任务，包括机器翻译，文本摘要和数据对文本。最近的研究表明，考虑到两个主要错误（例如，被误译的令牌）和小错误（例如流利性的不完美）都可以产生高质量的人类判断。这激发了我们通过自动错误分析实现评估指标（类似人类评估）的最终目标。为此，我们通过结合类似人类的错误分析策略（即Bartscore ++）来增强BartScore，其中最终得分均由重大错误和小错误的评估。实验结果表明，Bartscore ++可以始终如一地提高Vanilla Bartscore的性能，并在25个测试设置中的20个中胜过现有的最高得分指标。我们希望我们的技术也可以扩展到其他基于模型的预训练的指标。我们将发布我们的代码和脚本以促进社区。

The state-of-the-art language model-based automatic metrics, e.g. BARTScore, benefiting from large-scale contextualized pre-training, have been successfully used in a wide range of natural language generation (NLG) tasks, including machine translation, text summarization, and data-to-text. Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors (e.g. imperfections in fluency) can produce high-quality human judgments. This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis. To this end, we augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors. Experimental results show that BARTScore++ can consistently improve the performance of vanilla BARTScore and outperform existing top-scoring metrics in 20 out of 25 test settings. We hope our technique can also be extended to other pre-trained model-based metrics. We will release our code and scripts to facilitate the community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题