论文标题
关于文本生成的基于模型的评估指标的盲点
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
论文作者
论文摘要
在这项工作中,我们探讨了一种有用但经常被忽略的方法,用于文本生成评估指标的鲁棒性分析:使用合成数据进行压力测试。基本上,我们设计并合成了广泛的潜在误差,并检查它们是否导致公制得分的相应下降。我们针对开放式生成,翻译和摘要的任务研究了一系列最近提出的评估指标。我们的实验揭示了现有指标中有趣的不敏感性,偏见甚至漏洞。例如,我们发现Bertscore被总结中的截断错误混淆了,而Mauve(构建在GPT-2顶部)对几代开始或中间的错误不敏感。此外,我们研究了这些盲点背后的原因,并提出了对文本生成更可靠评估的实用解决方法。我们已经在https://github.com/cloudygoose/blindspot_nlg上发布了代码和数据。
In this work, we explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics: stress tests with synthetic data. Basically, we design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. We examine a range of recently proposed evaluation metrics based on pretrained language models, for the tasks of open-ended generation, translation, and summarization. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics. For example, we find that BERTScore is confused by truncation errors in summarization, and MAUVE (built on top of GPT-2) is insensitive to errors at the beginning or middle of generations. Further, we investigate the reasons behind these blind spots and suggest practical workarounds for a more reliable evaluation of text generation. We have released our code and data at https://github.com/cloudygoose/blindspot_nlg.