论文标题

关于马可女士的生存偏见

On Survivorship Bias in MS MARCO

论文作者

Gupta, Prashansa, MacAvaney, Sean

论文摘要

生存偏见是专注于选择过程的积极结果并忽略产生负面结果的结果的趋势。我们观察到,鉴于注释者找不到38---45%的查询的答案,因此这种偏见可能存在于流行的MS MARCO数据集中,从而导致这些查询在培训和评估过程中被丢弃。尽管我们发现,马可女士中的一些废弃查询是不明智的或无法回答的,但如果更完全注释该集合,则可以回答许多有效的问题(使用现代排名技术约为三分之二)。这个生存能力以多种方式扭曲了MS MARCO的收藏。我们发现,它会影响查询的自然分布,从所需的信息类型角度来看。当用于评估时,我们发现偏差可能会产生观察到的绝对性能得分的很大变形。最后,鉴于马可女士经常用于模型训练,我们基于MARCO女士的子集训练模型,该模拟了更多的生存偏见。我们发现,在此设置中训练有训练的模型在对数据集的版本进行评估时,更完整的注释,并且在零拍传输时差3.5%。我们的发现是对Marco女士进一步注释的其他最新建议的补充,但重点是废弃的查询。重现本文结果的代码和数据可在在线附录中获得。

Survivorship bias is the tendency to concentrate on the positive outcomes of a selection process and overlook the results that generate negative outcomes. We observe that this bias could be present in the popular MS MARCO dataset, given that annotators could not find answers to 38--45% of the queries, leading to these queries being discarded in training and evaluation processes. Although we find that some discarded queries in MS MARCO are ill-defined or otherwise unanswerable, many are valid questions that could be answered had the collection been annotated more completely (around two thirds using modern ranking techniques). This survivability problem distorts the MS MARCO collection in several ways. We find that it affects the natural distribution of queries in terms of the type of information needed. When used for evaluation, we find that the bias likely yields a significant distortion of the absolute performance scores observed. Finally, given that MS MARCO is frequently used for model training, we train models based on subsets of MS MARCO that simulates more survivorship bias. We find that models trained in this setting are up to 9.9% worse when evaluated on versions of the dataset with more complete annotations, and up to 3.5% worse at zero-shot transfer. Our findings are complementary to other recent suggestions for further annotation of MS MARCO, but with a focus on discarded queries. Code and data for reproducing the results of this paper are available in an online appendix.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源