Cookies

We use cookies to ensure that we give you the best experience on our website. By continuing to browse this repository, you give consent for essential cookies to be used. You can read more about our Privacy and Cookie Policy.


Durham Research Online
You are in:

Doubt and Redundancy Kill Soft Errors---Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software

Samfass, Philipp and Weinzierl, Tobias and Reinarz, Anne and Bader, Michael (2021) 'Doubt and Redundancy Kill Soft Errors---Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software.', Supercomputing 21 - FTXS Workshop St Louis, MO, 14-19 Nov 2021.

Abstract

Resilient algorithms in high-performance computing are subject to rigorous non-functional constraints. Resiliency must not increase the runtime, memory footprint or I/O demands too significantly. We propose a task-based soft error detection scheme that relies on error criteria per task outcome. They formalise how “dubious” an outcome is, i.e. how likely it contains an error. Our whole simulation is replicated once, forming two teams of MPI ranks that share their task results. Thus, ideally each team handles only around half of the workload. If a task yields large error criteria values, i.e. is dubious, we compute the task redundantly and compare the outcomes. Whenever they disagree, the task result with a lower error likeliness is accepted. We obtain a self-healing, resilient algorithm which can compensate silent floating-point errors without a significant performance, I/O or memory footprint penalty. Case studies however suggest that a careful, domain-specific tailoring of the error criteria remains essential.

Item Type:Conference item (Paper)
Full text:(AM) Accepted Manuscript
Download PDF
(788Kb)
Status:Peer-reviewed
Publisher Web site:https://ieeexplore.ieee.org/xpl/conhome/9909541/proceeding
Publisher statement:© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Date accepted:04 October 2021
Date deposited:03 November 2022
Date of first online publication:2021
Date first made open access:03 November 2022

Save or Share this output

Export:
Export
Look up in GoogleScholar