对数据库分析的GPU和CPU的基本性能特征的研究（扩展版）

论文标题

对数据库分析的GPU和CPU的基本性能特征的研究（扩展版）

A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics (Extended Version)

论文作者

Shanbhag, Anil, Madden, Samuel, Yu, Xiangyao

论文摘要

在基于GPU的数据库系统上，已经大量的兴奋和最近的工作。先前的工作声称，这些系统可以在分析工作负载（例如决策支持和商业智能应用程序中发现的工作负载）上执行比基于CPU的数据库系统更好的数量级。硬件专家会怀疑这些主张。考虑到数据库运算符是内存宽宽结合的一般观念，人们希望最大增益大致等于GPU的内存带宽与CPU的内存带宽的比率。在本文中，我们采用了一种基于模型的方法来了解何时以及为何在CPU上与CPU上的查询的性能获得相对于带宽比（在现代硬件上大约为16倍）不同。我们提出了一个平行例程的库，可以将其组合在一起，以在GPU上运行完整的SQL查询，并以最小的身份化的开销。我们实施单独的查询操作员，以表明虽然选择，投影和各种的加速度接近带宽比，但加入了。由于硬件功能的差异，实现更少的加速。有趣的是，我们在流行的分析工作负载上表明，尽管单个操作员的速度小于带宽比，但在GPU上运行的全部查询性能增长超过了带宽比，这是由于在CPU上矢量锁定的操作员的限制，导致GPU在Benchmark上的GPU超过25倍。

There has been significant amount of excitement and recent work on GPU-based database systems. Previous work has claimed that these systems can perform orders of magnitude better than CPU-based database systems on analytical workloads such as those found in decision support and business intelligence applications. A hardware expert would view these claims with suspicion. Given the general notion that database operators are memory-bandwidth bound, one would expect the maximum gain to be roughly equal to the ratio of the memory bandwidth of GPU to that of CPU. In this paper, we adopt a model-based approach to understand when and why the performance gains of running queries on GPUs vs on CPUs vary from the bandwidth ratio (which is roughly 16x on modern hardware). We propose Crystal, a library of parallel routines that can be combined together to run full SQL queries on a GPU with minimal materialization overhead. We implement individual query operators to show that while the speedups for selection, projection, and sorts are near the bandwidth ratio, joins achieve less speedup due to differences in hardware capabilities. Interestingly, we show on a popular analytical workload that full query performance gain from running on GPU exceeds the bandwidth ratio despite individual operators having speedup less than bandwidth ratio, as a result of limitations of vectorizing chained operators on CPUs, resulting in a 25x speedup for GPUs over CPUs on the benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题