论文标题
信息检索中的批量评估指标:措施,量表和含义
Batch Evaluation Metrics in Information Retrieval: Measures, Scales, and Meaning
论文作者
论文摘要
A sequence of recent papers has considered the role of measurement scales in information retrieval (IR) experimentation, and presented the argument that (only) uniform-step interval scales should be used, and hence that well-known metrics such as reciprocal rank, expected reciprocal rank, normalized discounted cumulative gain, and average precision, should be either discarded as measurement tools, or adapted so that their metric values lie at uniformly-spaced points在数字线上。这些论文描绘了过去几十年IR评估的相当惨淡的图片,与社区对实际实验和可衡量的改进的重视不一致。 我们在这项工作中的目的是挑战这一立场。特别是,我们认为,只要选择每个目标点的外部原因,从分类数据和序数数据到数字行上的一组点映射是有效的。我们首先考虑测量量表的一般作用,以及分类,序数,间隔,比率和绝对数据收集的一般作用。在这些类别的前两个类别中,我们还提供了由数字映射捕获和表示实际数字行的知识的示例。然后将重点放在信息检索上,我们认为文档排名是分类数据,有效性指标的作用是提供一个单个值,该值代表任何给定排名的用户或用户的有用性,并且有用性可以用作比例量表的连续变量。也就是说,我们认为当前的IR指标是有充分基量的,而且,这些指标的当前形式比在提议的“介入”版本中更有意义。
A sequence of recent papers has considered the role of measurement scales in information retrieval (IR) experimentation, and presented the argument that (only) uniform-step interval scales should be used, and hence that well-known metrics such as reciprocal rank, expected reciprocal rank, normalized discounted cumulative gain, and average precision, should be either discarded as measurement tools, or adapted so that their metric values lie at uniformly-spaced points on the number line. These papers paint a rather bleak picture of past decades of IR evaluation, at odds with the community's overall emphasis on practical experimentation and measurable improvement. Our purpose in this work is to challenge that position. In particular, we argue that mappings from categorical and ordinal data to sets of points on the number line are valid provided there is an external reason for each target point to have been selected. We first consider the general role of measurement scales, and of categorical, ordinal, interval, ratio, and absolute data collections. In connection with the first two of those categories we also provide examples of the knowledge that is captured and represented by numeric mappings to the real number line. Focusing then on information retrieval, we argue that document rankings are categorical data, and that the role of an effectiveness metric is to provide a single value that represents the usefulness to a user or population of users of any given ranking, with usefulness able to be represented as a continuous variable on a ratio scale. That is, we argue that current IR metrics are well-founded, and, moreover, that those metrics are more meaningful in their current form than in the proposed "intervalized" versions.