论文标题
A-tucker:用于CPU和GPU的密集张量的输入自适应和无矩阵的塔克分解
a-Tucker: Input-Adaptive and Matricization-Free Tucker Decomposition for Dense Tensors on CPUs and GPUs
论文作者
论文摘要
塔克分解是分析和压缩大规模张力数据的最受欢迎的模型之一。现有的Tucker分解算法通常依靠单个求解器来计算因子矩阵和核心张量,并且不足以适应输入数据和硬件的多样性。此外,为了利用高效的GEMM内核,大多数Tucker分解实现都利用了显式的疗程,这可能会在数据转换和内存使用方面引入额外的成本。在本文中,我们介绍了A-Tucker,这是一个新的框架,用于对密集张量的输入自适应和无矩阵的塔克分解。提出了一种模式的柔性Tucker分解算法,以使因子矩阵和核心张量的不同求解器的切换,并应用机器学习的自适应求解器选择器以自动应对输入数据和硬件的变化。为了进一步提高性能并提高记忆效率,我们以完全无矩阵的方式实施了A-Tucker,而无需张紧器和矩阵之间的任何转换。各种合成和现实张量的实验表明,A-Tucker在CPU和GPU上的现有作品大大优于现有作品。
Tucker decomposition is one of the most popular models for analyzing and compressing large-scale tensorial data. Existing Tucker decomposition algorithms usually rely on a single solver to compute the factor matrices and core tensor, and are not flexible enough to adapt with the diversities of the input data and the hardware. Moreover, to exploit highly efficient GEMM kernels, most Tucker decomposition implementations make use of explicit matricizations, which could introduce extra costs in terms of data conversion and memory usage. In this paper, we present a-Tucker, a new framework for input-adaptive and matricization-free Tucker decomposition of dense tensors. A mode-wise flexible Tucker decomposition algorithm is proposed to enable the switch of different solvers for the factor matrices and core tensor, and a machine-learning adaptive solver selector is applied to automatically cope with the variations of both the input data and the hardware. To further improve the performance and enhance the memory efficiency, we implement a-Tucker in a fully matricization-free manner without any conversion between tensors and matrices. Experiments with a variety of synthetic and real-world tensors show that a-Tucker can substantially outperform existing works on both CPUs and GPUs.