论文标题

NLP结果的量化可重复性评估

Quantified Reproducibility Assessment of NLP Results

论文作者

Belz, Anya, Popović, Maja, Mille, Simon

论文摘要

本文介绍并测试了一种基于计量学概念和定义的量化可重复评估(QRA)的方法。 QRA产生一个单个分数,估计给定系统的可重复性程度和评估度量,基于不同再现之间的分数和差异。我们在18个系统和评估度量组合(涉及不同的NLP任务和评估类型)上测试QRA,为此我们具有原始结果和一到七个繁殖结果。所提出的QRA方法产生的可重复可重复学得分得分不仅相同,而且是不同的原始研究。我们发现,所提出的方法有助于洞察复制之间变化的原因,并允许得出结论,了解系统和/或评估设计的变化可能会导致可重复性的提高。

This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 system and evaluation measure combinations (involving diverse NLP tasks and types of evaluation), for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but of different original studies. We find that the proposed method facilitates insights into causes of variation between reproductions, and allows conclusions to be drawn about what changes to system and/or evaluation design might lead to improved reproducibility.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源