语言生成评估指标的奇怪案例：一个警示性的故事

论文标题

语言生成评估指标的奇怪案例：一个警示性的故事

Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

论文作者

Caglayan, Ozan, Madhyastha, Pranava, Specia, Lucia

论文摘要

语言产生系统的自动评估是自然语言处理中的一个充分研究的问题。尽管每年都提出新颖的指标，但仍有一些流行的指标仍然是事实上的指标，即尽管已知限制，但仍可以评估图像字幕和机器翻译等任务。这部分是由于易用性，部分原因是研究人员期望看到它们并知道如何解释它们。在本文中，我们敦促社区通过在多个数据集，语言对和任务上证明重要的故障案例，以更仔细地考虑他们如何自动评估其模型。我们的实验表明，指标（i）通常更喜欢系统输出而不是人为撰写的文本，（ii）对纠正稀有单词的翻译不敏感，（iii）在将单个句子作为整个测试集的系统输出中时会产生出人意料的高分。

Automatic evaluation of language generation systems is a well-studied problem in Natural Language Processing. While novel metrics are proposed every year, a few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation, despite their known limitations. This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them. In this paper, we urge the community for more careful consideration of how they automatically evaluate their models by demonstrating important failure cases on multiple datasets, language pairs and tasks. Our experiments show that metrics (i) usually prefer system outputs to human-authored texts, (ii) can be insensitive to correct translations of rare words, (iii) can yield surprisingly high scores when given a single sentence as system output for the entire test set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题