论文标题

使用两级模型估算无声数据损坏率

Estimating Silent Data Corruption Rates Using a Two-Level Model

论文作者

Hari, Siva Kumar Sastry, Rech, Paolo, Tsai, Timothy, Stephenson, Mark, Zulfiqar, Arslan, Sullivan, Michael, Shirvani, Philip, Racunas, Paul, Emer, Joel, Keckler, Stephen W.

论文摘要

高性能和关键安全系统架构师必须准确评估处理器对软错误的应用程序级的无声数据腐败(SDC)。这样的评估需要从低级状态到程序输出的粒子打击一直到误差传播。依靠低水平模拟的现有方法无法评估完整的应用,因为它们的速度较慢,而在加速粒子梁中的应用级加速故障测试通常是不切实际的。我们提出了一种新的两级方法,用于应用程序弹性评估,以克服这些挑战。所提出的方法将应用程序故障率估计分解为(1)确定粒子在架构级别上如何表现出粒子如何在体系结构级别表现出来,以及(2)测量此类体系结构级别的表现如何传播到程序输出。我们证明了这种方法对GPU体系结构的有效性。我们还表明,仅使用两个步骤之一可以高估SDC速率并产生不同的趋势 - 两者的组成是准确的可靠性建模所需的。

High-performance and safety-critical system architects must accurately evaluate the application-level silent data corruption (SDC) rates of processors to soft errors. Such an evaluation requires error propagation all the way from particle strikes on low-level state up to the program output. Existing approaches that rely on low-level simulations with fault injection cannot evaluate full applications because of their slow speeds, while application-level accelerated fault testing in accelerated particle beams is often impractical. We present a new two-level methodology for application resilience evaluation that overcomes these challenges. The proposed approach decomposes application failure rate estimation into (1) identifying how particle strikes in low-level unprotected state manifest at the architecture-level, and (2) measuring how such architecture-level manifestations propagate to the program output. We demonstrate the effectiveness of this approach on GPU architectures. We also show that using just one of the two steps can overestimate SDC rates and produce different trends---the composition of the two is needed for accurate reliability modeling.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源