论文标题
选择数据探索的子表
Selecting Sub-tables for Data Exploration
论文作者
论文摘要
我们提出了一个框架,用于创建大型数据表的小型,内容丰富的子表,以促进数据科学的第一步:数据探索。给定一个大数据表T t,目标是通过选择T行的子集并将其投影在T列的一个子集上来创建小的固定尺寸的子桌。问题是:应该选择哪些行和列来产生内容丰富的子桌? 我们基于两个互补指标:细胞覆盖范围正式化了“信息性”的概念,该范围衡量了子桌子在t和多样性中捕获著名的关联规则的能力。由于使用这些指标计算最佳子表被证明是不可行的,因此我们提供了一种有效的算法,该算法间接考虑使用表嵌入的关联规则。最终的框架可用于可视化完整的子桌子,以及在子桌上显示查询结果,从而使用户能够快速理解结果并确定后续查询。实验结果表明,我们可以根据我们的指标和用户研究的反馈来有效计算高质量的子表。
We present a framework for creating small, informative sub-tables of large data tables to facilitate the first step of data science: data exploration. Given a large data table table T, the goal is to create a sub-table of small, fixed dimensions, by selecting a subset of T's rows and projecting them over a subset of T's columns. The question is: which rows and columns should be selected to yield an informative sub-table? We formalize the notion of "informativeness" based on two complementary metrics: cell coverage, which measures how well the sub-table captures prominent association rules in T, and diversity. Since computing optimal sub-tables using these metrics is shown to be infeasible, we give an efficient algorithm which indirectly accounts for association rules using table embedding. The resulting framework can be used for visualizing the complete sub-table, as well as for displaying the results of queries over the sub-table, enabling the user to quickly understand the results and determine subsequent queries. Experimental results show that we can efficiently compute high-quality sub-tables as measured by our metrics, as well as by feedback from user-studies.