论文标题

数据的价值是多少?关于数据质量估计的数学方法

What is the Value of Data? On Mathematical Methods for Data Quality Estimation

论文作者

Raviv, Netanel, Jain, Siddharth, Bruck, Jehoshua

论文摘要

数据是信息时代最重要的资产之一,其社会影响无可争议。然而,缺乏评估数据质量的严格方法。在本文中,我们提出了针对给定数据集质量的正式定义。我们通过称为预期直径的数量来评估数据集的质量,该直径衡量了两个随机选择的假设之间的预期分歧,并最近在主动学习中找到了应用。我们专注于布尔式超平面,并利用傅立叶分析,代数和概率方法的集合来提出理论保证和实用解决方案,以计算预期直径。我们还研究了预期直径在代数结构化数据集上的行为,进行了验证质量概念的实验,并证明了我们技术的可行性。

Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this paper, we propose a formal definition for the quality of a given dataset. We assess a dataset's quality by a quantity we call the expected diameter, which measures the expected disagreement between two randomly chosen hypotheses that explain it, and has recently found applications in active learning. We focus on Boolean hyperplanes, and utilize a collection of Fourier analytic, algebraic, and probabilistic methods to come up with theoretical guarantees and practical solutions for the computation of the expected diameter. We also study the behaviour of the expected diameter on algebraically structured datasets, conduct experiments that validate this notion of quality, and demonstrate the feasibility of our techniques.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源