通过对比集评估模型的本地决策边界

论文标题

通过对比集评估模型的本地决策边界

Evaluating Models' Local Decision Boundaries via Contrast Sets

论文作者

Gardner, Matt, Artzi, Yoav, Basmova, Victoria, Berant, Jonathan, Bogin, Ben, Chen, Sihao, Dasigi, Pradeep, Dua, Dheeru, Elazar, Yanai, Gottumukkala, Ananth, Gupta, Nitish, Hajishirzi, Hanna, Ilharco, Gabriel, Khashabi, Daniel, Lin, Kevin, Liu, Jiangming, Liu, Nelson F., Mulcaire, Phoebe, Ning, Qiang, Singh, Sameer, Smith, Noah A., Subramanian, Sanjay, Tsarfaty, Reut, Wallace, Eric, Zhang, Ally, Zhou, Ben

论文摘要

监督学习的标准测试集评估分布概括。不幸的是，当数据集具有系统的差距（例如注释伪像）时，这些评估会产生误导：模型可以学习在测试集上表现良好但不会捕获数据集预期功能的简单决策规则。我们为NLP提出了一个新的注释范式，该范式有助于弥合测试数据中的系统差距。特别是，在构建数据集后，我们建议数据集作者手动以小但有意义的方式（通常）更改金标签，创建对比度集的测试实例。对比集提供了模型决策边界的局部视图，该视图可用于更准确地评估模型的真实语言能力。我们通过为10种不同的NLP数据集创建对比集的功效（例如，删除阅读理解，UD解析，IMDB情感分析）。尽管我们的对比集并非明确的对抗性，但模型性能的明显低于原始测试集 - 在某些情况下最多可达25 \％。我们将对比度集作为新的评估基准，并鼓励未来的数据集建筑努力遵循类似的注释过程。

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25\% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题