论文标题

基于GNN的推荐系统在持续内存中的分析和优化

Analysis and Optimization of GNN-Based Recommender Systems on Persistent Memory

论文作者

Hu, Yuwei, Li, Jiajie, Yu, Zhongming, Zhang, Zhiru

论文摘要

图形神经网络(GNNS)已成为处理机器学习任务的有效方法,它为构建推荐系统带来了一种新方法,可以将推荐任务作为用户 - 项目二级组图上的链接预测问题进行表述。在大图上培训基于GNN的推荐系统(GNNRECSYS)会引起大型内存足迹,很容易超过典型服务器上的DRAM容量。现有的解决方案诉诸分布式子图培训,这是由于动态构建子图和各个子图的大量冗余的高成本而效率低下。 新兴的持续记忆技术以可承受的成本提供了比DRAM更大的记忆能力,这使单机Gnnrecsys培训可行,这消除了分布式培训中的效率低下。与DRAM相比,使用持续的存储器设备的一个主要问题是它们相对较低的带宽。由于其主要的计算内核稀疏且内存访问密集,因此这种限制可能对Gnnrecsys工作量的高性能特别有害。为了了解持续记忆是否适合GNNRECSYS培训,我们对Gnnrecsys工作负载进行了深入的表征,并对它们在持久内存设备(即Intel Optane)上的性能进行了全面分析。基于分析,我们提供有关如何为GNNRECSYS工作负载配置Optane的指导。此外,我们提出了大批量培训的技术,以充分实现单机器人Gnnrecsys培训的优势。我们的实验结果表明,通过调谐的批次尺寸和最佳系统配置,基于Optane的单机器人GNNRECSYS训练优于较大的边距分布的训练,尤其是在处理深GNN模型时。

Graph neural networks (GNNs), which have emerged as an effective method for handling machine learning tasks on graphs, bring a new approach to building recommender systems, where the task of recommendation can be formulated as the link prediction problem on user-item bipartite graphs. Training GNN-based recommender systems (GNNRecSys) on large graphs incurs a large memory footprint, easily exceeding the DRAM capacity on a typical server. Existing solutions resort to distributed subgraph training, which is inefficient due to the high cost of dynamically constructing subgraphs and significant redundancy across subgraphs. The emerging persistent memory technologies provide a significantly larger memory capacity than DRAMs at an affordable cost, making single-machine GNNRecSys training feasible, which eliminates the inefficiencies in distributed training. One major concern of using persistent memory devices for GNNRecSys is their relatively low bandwidth compared with DRAMs. This limitation can be particularly detrimental to achieving high performance for GNNRecSys workloads since their dominant compute kernels are sparse and memory access intensive. To understand whether persistent memory is a good fit for GNNRecSys training, we perform an in-depth characterization of GNNRecSys workloads and a comprehensive analysis of their performance on a persistent memory device, namely, Intel Optane. Based on the analysis, we provide guidance on how to configure Optane for GNNRecSys workloads. Furthermore, we present techniques for large-batch training to fully realize the advantages of single-machine GNNRecSys training. Our experiment results show that with the tuned batch size and optimal system configuration, Optane-based single-machine GNNRecSys training outperforms distributed training by a large margin, especially when handling deep GNN models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源