论文标题
Mesorasi:通过延迟聚集对点云分析的体系结构支持
Mesorasi: Architecture Support for Point Cloud Analytics via Delayed-Aggregation
论文作者
论文摘要
Point Cloud Analytics有望成为电池供电的嵌入式和移动平台上的关键工作量,并在各种新兴的应用程序域,例如自动驾驶,机器人技术和增强现实,而效率至关重要。本文提出了Mesorasi,这是一种算法 - 建筑共同设计的系统,该系统同时提高了点云分析的性能和能效,同时保持其准确性。我们对最新点云算法的广泛特征表明,尽管在结构上让人联想到卷积神经网络(CNN),但点云算法表现出固有的计算和记忆效率,这是由于点云的独特特征。我们提出了延迟聚集,这是一种用于构建有效点云算法的新算法原始算法。延迟聚集会隐藏性能瓶颈,并通过利用Point Cloud算法中密钥操作的大约分布属性来减少计算和内存冗余。延迟聚集使点云算法在移动GPU上达到1.6倍的速度和减少51.1%的能量,同时保持准确性(-0.9%损失至1.2%)。为了最大程度地提高算法优势,我们提出了对当代CNN加速器的较小扩展,可以将其集成到无需修改其他SOC组件的情况下将其集成到移动系统中。通过额外的硬件支持,Mesorasi可达到高达3.6倍的速度。
Point cloud analytics is poised to become a key workload on battery-powered embedded and mobile platforms in a wide range of emerging application domains, such as autonomous driving, robotics, and augmented reality, where efficiency is paramount. This paper proposes Mesorasi, an algorithm-architecture co-designed system that simultaneously improves the performance and energy efficiency of point cloud analytics while retaining its accuracy. Our extensive characterizations of state-of-the-art point cloud algorithms show that, while structurally reminiscent of convolutional neural networks (CNNs), point cloud algorithms exhibit inherent compute and memory inefficiencies due to the unique characteristics of point cloud data. We propose delayed-aggregation, a new algorithmic primitive for building efficient point cloud algorithms. Delayed-aggregation hides the performance bottlenecks and reduces the compute and memory redundancies by exploiting the approximately distributive property of key operations in point cloud algorithms. Delayed-aggregation let point cloud algorithms achieve 1.6x speedup and 51.1% energy reduction on a mobile GPU while retaining the accuracy (-0.9% loss to 1.2% gains). To maximize the algorithmic benefits, we propose minor extensions to contemporary CNN accelerators, which can be integrated into a mobile Systems-on-a-Chip (SoC) without modifying other SoC components. With additional hardware support, Mesorasi achieves up to 3.6x speedup.