Please use this identifier to cite or link to this item:
Scopus Web of Science® Altmetric
Type: Journal article
Title: Fault tolerant computation with the sparse grid combination technique
Author: Harding, B.
Hegland, M.
Larson, J.
Southern, J.
Citation: SIAM Journal on Scientific Computing, 2015; 37(3):331-353
Publisher: Society for Industrial and Applied Mathematics
Issue Date: 2015
ISSN: 1064-8275
Statement of
Brendan Harding, Markus Hegland, Jay Larson and James Southern
Abstract: This paper continues to develop a fault tolerant extension of the sparse grid combination technique recently proposed in [B. Harding and M. Hegland, ANZIAM J. Electron. Suppl., 54 (2013), pp. C394--C411]. This approach to fault tolerance is novel for two reasons: First, the combination technique adds an additional level of parallelism, and second, it provides algorithm-based fault tolerance so that solutions can still be recovered if failures occur during computation. Previous work indicates how the combination technique may be adapted for a low number of faults. In this paper we develop a generalization of the combination technique for which arbitrary collections of coarse approximations may be combined to obtain an accurate approximation. A general fault tolerant combination technique for large numbers of faults is a natural consequence of this work. Using a renewal model for the time between faults on each node of a high performance computer, we also provide bounds on the expected error for interpolation with this algorithm in the presence of faults. Numerical experiments solving the scalar advection PDE demonstrate that the algorithm is resilient to faults on a real application. It is observed that the time to solution is not significantly affected by the presence of (simulated) faults. Additionally the expected error increases with the number of faults but is relatively small even for high fault rates. A comparison with traditional checkpoint-restart methods applied to the combination technique shows that our approach is highly scalable with respect to the number of faults.
Keywords: Exascale computing; algorithm-based fault tolerance; sparse grid combination technique; parallel algorithms
Rights: © 2015 Society for Industrial and Applied Mathematics
DOI: 10.1137/140964448
Grant ID:
Appears in Collections:Aurora harvest 8
Mathematical Sciences publications

Files in This Item:
There are no files associated with this item.

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.