Skip to main content

2019 | OriginalPaper | Buchkapitel

Resilient Optimistic Termination Detection for the Async-Finish Model

verfasst von : Sara S. Hamouda, Josh Milthorpe

Erschienen in: High Performance Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Driven by increasing core count and decreasing mean-time-to-failure in supercomputers, HPC runtime systems must improve support for dynamic task-parallel execution and resilience to failures. The async-finish task model, adapted for distributed systems as the asynchronous partitioned global address space programming model, provides a simple way to decompose a computation into nested task groups, each managed by a ‘finish’ that signals the termination of all tasks within the group.
For distributed termination detection, maintaining a consistent view of task state across multiple unreliable processes requires additional book-keeping when creating or completing tasks and finish-scopes. Runtime systems which perform this book-keeping pessimistically, i.e. synchronously with task state changes, add a high communication overhead compared to non-resilient protocols. In this paper, we propose optimistic finish, the first message-optimal resilient termination detection protocol for the async-finish model. By avoiding the communication of certain task and finish events, this protocol allows uncertainty about the global structure of the computation which can be resolved correctly at failure time, thereby reducing the overhead for failure-free execution.
Performance results using micro-benchmarks and the LULESH hydrodynamics proxy application show significant reductions in resilience overhead with optimistic finish compared to pessimistic finish. Our optimistic finish protocol is applicable to any task-based runtime system offering automatic termination detection for dynamic graphs of non-migratable tasks.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
2.
Zurück zum Zitat Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)MathSciNetCrossRef Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)MathSciNetCrossRef
3.
Zurück zum Zitat Cunningham, D., et al.: Resilient X10: efficient failure-aware programming. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 67–80 (2014) Cunningham, D., et al.: Resilient X10: efficient failure-aware programming. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 67–80 (2014)
4.
Zurück zum Zitat Dijkstra, E.W., Scholten, C.S.: Termination detection for diffusing computations. Inf. Process. Lett. 11(1), 1–4 (1980)MathSciNetCrossRef Dijkstra, E.W., Scholten, C.S.: Termination detection for diffusing computations. Inf. Process. Lett. 11(1), 1–4 (1980)MathSciNetCrossRef
5.
Zurück zum Zitat Kestor, G., Krishnamoorthy, S., Ma, W.: Localized fault recovery for nested fork-join programs. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 397–408. IEEE (2017) Kestor, G., Krishnamoorthy, S., Ma, W.: Localized fault recovery for nested fork-join programs. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 397–408. IEEE (2017)
6.
Zurück zum Zitat Lai, T.H., Wu, L.F.: An (n-1)-resilient algorithm for distributed termination detection. IEEE Trans. Parallel Distrib. Syst. 6(1), 63–78 (1995)CrossRef Lai, T.H., Wu, L.F.: An (n-1)-resilient algorithm for distributed termination detection. IEEE Trans. Parallel Distrib. Syst. 6(1), 63–78 (1995)CrossRef
7.
Zurück zum Zitat Lifflander, J., Miller, P., Kale, L.: Adoption protocols for fanout-optimal fault-tolerant termination detection. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM (2013) Lifflander, J., Miller, P., Kale, L.: Adoption protocols for fanout-optimal fault-tolerant termination detection. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM (2013)
8.
Zurück zum Zitat Milthorpe, J., Grove, D., Herta, B., Tardieu, O.: Exploring the APGAS programming model using the LULESH proxy application. Technical report, RC25555, IBM Research (2015) Milthorpe, J., Grove, D., Herta, B., Tardieu, O.: Exploring the APGAS programming model using the LULESH proxy application. Technical report, RC25555, IBM Research (2015)
9.
Zurück zum Zitat Stewart, R., Maier, P., Trinder, P.: Transparent fault tolerance for scalable functional computation. J. Funct. Program. 26 (2016) Stewart, R., Maier, P., Trinder, P.: Transparent fault tolerance for scalable functional computation. J. Funct. Program. 26 (2016)
Metadaten
Titel
Resilient Optimistic Termination Detection for the Async-Finish Model
verfasst von
Sara S. Hamouda
Josh Milthorpe
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-20656-7_15

Premium Partner