论文标题

修改异步Jacobi方法以进行数据损坏弹性

Modifying the Asynchronous Jacobi Method for Data Corruption Resilience

论文作者

Vogl, Christopher J., Atkins, Zachary, Fox, Alyson, Miedlar, Agnieszka, Ponce, Colin

论文摘要

将科学计算从高性能计算(HPC)和云计算(CC)环境转移到边缘的设备,即物理上近乎感兴趣的工具,近年来引起了极大的兴趣。这种边缘计算环境可以在原位数据上运行,从而为HPC和CC设施提供了诱人的益处,包括避免传输成本,增加数据隐私和实时数据分析。由于边缘计算环境的固有不可靠性,必须在实现边缘计算的好处之前开发新的容错方法。由基于算法的容错性激励,开发了异步雅各比(ASJ)方法的变体,该变体是通过拒绝从融合理论衍生的界限中拒绝邻居设备的解决方案近似来实现数据损坏的弹性。二维泊松问题上的数值结果显示了新的排斥标准,以及标准所依赖的最短路径长度的新颖近似值,在存在某些类型的数据损坏的情况下恢复了ASJ变体的收敛性。当分析结合中的奇异值近似时,获得数值结果。还探索了具有更密集的稀疏模式的线性系统。所有结果表明,对数据腐败的成功弹性取决于界限是否足够快地收紧了在迭代演变进化之前拒绝损坏的数据,这显着偏离了定义界限的收敛理论所预测的。该观察结果概括了针对其他异步算法的基于算法的容错的未来工作,包括即将采用的方法来利用Krylov子空间。

Moving scientific computation from high-performance computing (HPC) and cloud computing (CC) environments to devices on the edge, i.e., physically near instruments of interest, has received tremendous interest in recent years. Such edge computing environments can operate on data in-situ, offering enticing benefits over data aggregation to HPC and CC facilities that include avoiding costs of transmission, increased data privacy, and real-time data analysis. Because of the inherent unreliability of edge computing environments, new fault tolerant approaches must be developed before the benefits of edge computing can be realized. Motivated by algorithm-based fault tolerance, a variant of the asynchronous Jacobi (ASJ) method is developed that achieves resilience to data corruption by rejecting solution approximations from neighbor devices according to a bound derived from convergence theory. Numerical results on a two-dimensional Poisson problem show the new rejection criterion, along with a novel approximation to the shortest path length on which the criterion depends, restores convergence for the ASJ variant in the presence of certain types data corruption. Numerical results are obtained for when the singular values in the analytic bound are approximated. A linear system with a more dense sparsity pattern is also explored. All results indicate that successful resilience to data corruption depends on whether the bound tightens fast enough to reject corrupted data before the iteration evolution deviates significantly from that predicted by the convergence theory defining the bound. This observation generalizes to future work on algorithm-based fault tolerance for other asynchronous algorithms, including upcoming approaches that leverage Krylov subspaces.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源