论文标题
表征频繁的项目集挖掘的交易数据库
Characterizing Transactional Databases for Frequent Itemset Mining
论文作者
论文摘要
本文介绍了一项研究,用于频繁的项目集挖掘中使用的交易数据库的特征。这种特征通常用于基准和了解在这些数据库上使用的数据挖掘算法。我们的研究的目的是给出一张图片,说明这些基准数据库的多样性和代表性一般,而且在文献中发现的特定经验研究的背景下。我们提出的指标列表包含文献中发现的许多现有指标以及新的指标。我们的研究表明,我们的指标列表能够捕获许多数据集的内部复杂性,因此为交易数据集的表征提供了良好的基础。最后,我们根据我们的特征提供一组代表性数据集,可以安全地用作基准标准。
This paper presents a study of the characteristics of transactional databases used in frequent itemset mining. Such characterizations have typically been used to benchmark and understand the data mining algorithms working on these databases. The aim of our study is to give a picture of how diverse and representative these benchmarking databases are, both in general but also in the context of particular empirical studies found in the literature. Our proposed list of metrics contains many of the existing metrics found in the literature, as well as new ones. Our study shows that our list of metrics is able to capture much of the datasets' inner complexity and thus provides a good basis for the characterization of transactional datasets. Finally, we provide a set of representative datasets based on our characterization that may be used as a benchmark safely.