论文标题
TRUA:科学网格中灵活的用户定义可用性的有效任务复制
Trua: Efficient Task Replication for Flexible User-defined Availability in Scientific Grids
论文作者
论文摘要
科学计算中不可避免的失败是不可避免的。随着科学应用和设施在过去几十年中增加了量表,发现故障的根本原因可能非常复杂,有时几乎是不可能的。不同的科学计算客户的可用性需求有所不同,并且愿意为可用性付费。与试图在科学网格中提供更高可用性的现有解决方案相反,我们提出了一个称为用户定义可用性(TRUA)任务复制的模型。 TRUA在科学网格中提供了灵活的,用户定义的可用性,使客户能够表达对计算提供商的可用性的渴望。 TRUA与现有的任务复制方法不同。首先,它依赖于从科学网格的虚拟层收集的历史性故障信息。失败的可靠性模型可以用Bimodal Johnson分布来表示,该分布与任何现有分布都不同。其次,它采用异常检测器来滤除异常故障。它还采用了新颖的选择算法来减轻失败的临时和空间相关性的影响,而不知道失败的根本原因。我们将TRUA应用于从开放科学网格(OSG)收集的实际痕迹上。我们的结果表明,TRUA可以成功满足用户定义的可用性需求。
Failure is inevitable in scientific computing. As scientific applications and facilities increase their scales over the last decades, finding the root cause of a failure can be very complex or at times nearly impossible. Different scientific computing customers have varying availability demands as well as a diverse willingness to pay for availability. In contrast to existing solutions that try to provide higher and higher availability in scientific grids, we propose a model called Task Replication for User-defined Availability (Trua). Trua provides flexible, user-defined, availability in scientific grids, allowing customers to express their desire for availability to computational providers. Trua differs from existing task replication approaches in two folds. First, it relies on the historic failure information collected from the virtual layer of the scientific grids. The reliability model for the failures can be represented with a bimodal Johnson distribution which is different from any existing distributions. Second, it adopts an anomaly detector to filter out anomalous failures; it additionally adopts novel selection algorithms to mitigate the effects of temporary and spatial correlations of the failures without knowing the root cause of the failures. We apply the Trua on real-world traces collected from the Open Science Grid (OSG). Our results show that the Trua can successfully meet user-defined availability demands.