论文标题

矢量且性能 - 可携带的QuickSort

Vectorized and performance-portable Quicksort

论文作者

Blacher, Mark, Giesen, Joachim, Sanders, Peter, Wassenberg, Jan

论文摘要

最近的工作表明,使用向量CPU指令的QuickSort实现可以超越广泛使用中的非矢量化算法。但是,这些实现通常是单线程,为特定的指令集实现,并仅限于一小部分关键类型。我们提出了这三个限制:我们提出的“ VQSort”算法集成到最先进的平行分类器“ IPS4O”中,几何平均速度为1.59。在四个平台上,在七个指令集(包括SVE和RISC-V V)上相同的实现作品。它还支持浮点和16-128位整数键。据我们所知,这是CPU上非校准键的最快排序,最大的20倍,是标准库中实现的排序算法的20倍。本文重点介绍了实现速度和便携性的实用工程方面,我们尚未看到这些方面是为了实施QuickSort。此外,我们引入了无需小阵列的注册排序的紧凑和无透骨排序网络,以及对矢量友好的枢轴采样策略,可抵抗对抗性输入。

Recent works showed that implementations of Quicksort using vector CPU instructions can outperform the non-vectorized algorithms in widespread use. However, these implementations are typically single-threaded, implemented for a particular instruction set, and restricted to a small set of key types. We lift these three restrictions: our proposed 'vqsort' algorithm integrates into the state-of-the-art parallel sorter 'ips4o', with a geometric mean speedup of 1.59. The same implementation works on seven instruction sets (including SVE and RISC-V V) across four platforms. It also supports floating-point and 16-128 bit integer keys. To the best of our knowledge, this is the fastest sort for non-tuple keys on CPUs, up to 20 times as fast as the sorting algorithms implemented in standard libraries. This paper focuses on the practical engineering aspects enabling the speed and portability, which we have not yet seen demonstrated for a Quicksort implementation. Furthermore, we introduce compact and transpose-free sorting networks for in-register sorting of small arrays, and a vector-friendly pivot sampling strategy that is robust against adversarial input.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源