2012 | OriginalPaper | Buchkapitel
A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI
verfasst von : Wesley Bland, Peng Du, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack Dongarra
Erschienen in: Euro-Par 2012 Parallel Processing
Verlag: Springer Berlin Heidelberg
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
Most predictions of Exascale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: 1) traditional checkpoint based approaches incur a steep overhead on failure free operations and 2) the dominant programming paradigm for parallel applications (the MPI standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI standard, to enable algorithmic based recovery, without incurring the overhead of customary periodic checkpointing. The validity and performance of this approach are evaluated on large scale systems, using the QR factorization as an example.