分布式贝叶斯矩阵分解用于大数据挖掘和聚类

论文标题

分布式贝叶斯矩阵分解用于大数据挖掘和聚类

Distributed Bayesian Matrix Decomposition for Big Data Mining and Clustering

论文作者

Zhang, Chihao, Yang, Yang, Zhang, Wei, Zhang, Shihua

论文摘要

矩阵分解是从现代应用程序产生的大数据中发现知识的基本工具之一。但是，使用单台计算机中的这种方法处理非常大的数据仍然效率低下或不可行。此外，大数据通常被分发并存储在不同的机器上。因此，这样的数据通常具有强烈的异质噪声。为大数据分析开发分布式矩阵分解是必不可少的且有用的。这样的方法应该很好地扩展，对异质噪声进行建模，并在分布式系统中解决通信问题。为此，我们提出了一个分布式的贝叶斯矩阵分解模型（DBMD），用于大数据挖掘和聚类。具体而言，我们采用三种策略来实施分布式计算，包括1）加速梯度下降，2）乘数的交替方向方法（ADMM）和3）统计推断。我们研究了这些算法的理论收敛行为。为了解决噪声的异质性，我们提出了一个最佳的插入加权平均值，以降低估计的方差。合成实验验证了我们的理论结果，现实世界实验表明，与其他分布式方法相比，我们的算法扩展到大数据良好，并实现优越或竞争性能。

Matrix decomposition is one of the fundamental tools to discover knowledge from big data generated by modern applications. However, it is still inefficient or infeasible to process very big data using such a method in a single machine. Moreover, big data are often distributedly collected and stored on different machines. Thus, such data generally bear strong heterogeneous noise. It is essential and useful to develop distributed matrix decomposition for big data analytics. Such a method should scale up well, model the heterogeneous noise, and address the communication issue in a distributed system. To this end, we propose a distributed Bayesian matrix decomposition model (DBMD) for big data mining and clustering. Specifically, we adopt three strategies to implement the distributed computing including 1) the accelerated gradient descent, 2) the alternating direction method of multipliers (ADMM), and 3) the statistical inference. We investigate the theoretical convergence behaviors of these algorithms. To address the heterogeneity of the noise, we propose an optimal plug-in weighted average that reduces the variance of the estimation. Synthetic experiments validate our theoretical results, and real-world experiments show that our algorithms scale up well to big data and achieves superior or competing performance compared to other distributed methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题