Skip to main content

2018 | OriginalPaper | Buchkapitel

EXAHD: An Exa-Scalable Two-Level Sparse Grid Approach for Higher-Dimensional Problems in Plasma Physics and Beyond

verfasst von : Mario Heene, Alfredo Parra Hinojosa, Michael Obersteiner, Hans-Joachim Bungartz, Dirk Pflüger

Erschienen in: High Performance Computing in Science and Engineering ' 17

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Within the current reporting period (04/2016–04/2017) of our HLRS project we have developed a scalable implementation of the fault-tolerant combination technique. Fault-tolerance is one of the key topics in the ongoing research of algorithms for future exascale systems. Our algorithms enable fault-tolerance for both hard and soft faults, for the efficient and massively parallel computation of high-dimensional PDEs without the need of checkpointing or process replication. The research project EXAHD is part of DFG’s priority program “Software for Exascale Computing” (SPPEXA). The project’s target application is the large-scale simulation of plasma turbulence with the code GENE. The report combines parts of three publications.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat L. Bautista-Gomez, F. Cappello, Detecting silent data corruption for extreme-scale MPI applications, in Proceedings of the 22nd European MPI Users’ Group Meeting (ACM, New York, 2015), p. 12 L. Bautista-Gomez, F. Cappello, Detecting silent data corruption for extreme-scale MPI applications, in Proceedings of the 22nd European MPI Users’ Group Meeting (ACM, New York, 2015), p. 12
2.
Zurück zum Zitat E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, F. Cappello, Lightweight silent data corruption detection based on runtime data analysis for HPC applications, in Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’15 (ACM, New York, 2015), pp. 275–278 E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, F. Cappello, Lightweight silent data corruption detection based on runtime data analysis for HPC applications, in Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’15 (ACM, New York, 2015), pp. 275–278
3.
Zurück zum Zitat W. Bland et al., A proposal for user-level failure mitigation in the mpi-3 standard. University of Tennessee (2012) W. Bland et al., A proposal for user-level failure mitigation in the mpi-3 standard. University of Tennessee (2012)
4.
Zurück zum Zitat M. Blatt, A. Burchardt, A. Dedner, C. Engwer, J. Fahlke, B. Flemisch, C. Gersbacher, C. Gräser, F. Gruber, C. Grüninger et al., The distributed and unified numerics environment, version 2.4. Arch. Numer. Softw. 4(100), 13–29 (2016) M. Blatt, A. Burchardt, A. Dedner, C. Engwer, J. Fahlke, B. Flemisch, C. Gersbacher, C. Gräser, F. Gruber, C. Grüninger et al., The distributed and unified numerics environment, version 2.4. Arch. Numer. Softw. 4(100), 13–29 (2016)
6.
Zurück zum Zitat H.J. Bungartz, M. Griebel, Sparse Grids. Acta Numer. 13, 147–269 (2004)CrossRef H.J. Bungartz, M. Griebel, Sparse Grids. Acta Numer. 13, 147–269 (2004)CrossRef
7.
Zurück zum Zitat F. Cappello et al., Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1(1), 5–28 (2014) F. Cappello et al., Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1(1), 5–28 (2014)
8.
Zurück zum Zitat T. Dannert, Gyrokinetische simulation von plasmaturbulenz mit gefangenen teilchen und elektromagnetischen effekten. Ph.D. thesis, Technische Universität München (2005) T. Dannert, Gyrokinetische simulation von plasmaturbulenz mit gefangenen teilchen und elektromagnetischen effekten. Ph.D. thesis, Technische Universität München (2005)
9.
Zurück zum Zitat E. Doyle, Y. Kamada, T. Osborne et al., Chapter 2: plasma confinement and transport. Nucl. Fusion 47(6), S18 (2007) E. Doyle, Y. Kamada, T. Osborne et al., Chapter 2: plasma confinement and transport. Nucl. Fusion 47(6), S18 (2007)
10.
Zurück zum Zitat J. Elliott, M. Hoemmen, F. Mueller, Evaluating the impact of SDC on the GMRES iterative solver, in 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IEEE, Piscataway, 2014), pp. 1193–1202 J. Elliott, M. Hoemmen, F. Mueller, Evaluating the impact of SDC on the GMRES iterative solver, in 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IEEE, Piscataway, 2014), pp. 1193–1202
11.
Zurück zum Zitat J. Elliott, M. Hoemmen, F. Mueller, Resilience in numerical methods: a position on fault models and methodologies (2014). arXiv preprint arXiv:1401.3013 J. Elliott, M. Hoemmen, F. Mueller, Resilience in numerical methods: a position on fault models and methodologies (2014). arXiv preprint arXiv:1401.3013
12.
Zurück zum Zitat D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, R. Brightwell, Detection and correction of silent data corruption for large-scale High-Performance Computing, in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (IEEE Computer Society Press, Piscataway, 2012), p. 78 D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, R. Brightwell, Detection and correction of silent data corruption for large-scale High-Performance Computing, in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (IEEE Computer Society Press, Piscataway, 2012), p. 78
14.
Zurück zum Zitat M. Griebel, W. Huber, U. Rüde, T. Störtkuhl, The combination technique for parallel sparse-grid-preconditioning or -solution of PDEs on workstation networks, in Parallel Processing: CONPAR 92 VAPP V. LNCS, vol. 634 (1992) M. Griebel, W. Huber, U. Rüde, T. Störtkuhl, The combination technique for parallel sparse-grid-preconditioning or -solution of PDEs on workstation networks, in Parallel Processing: CONPAR 92 VAPP V. LNCS, vol. 634 (1992)
15.
Zurück zum Zitat M. Griebel, M. Schneider, C. Zenger, A combination technique for the solution of sparse grid problems, in Iterative Methods in Linear Algebra (IMACS, Elsevier, North Holland, 1992), pp. 263–281MATH M. Griebel, M. Schneider, C. Zenger, A combination technique for the solution of sparse grid problems, in Iterative Methods in Linear Algebra (IMACS, Elsevier, North Holland, 1992), pp. 263–281MATH
16.
Zurück zum Zitat B. Harding et al., Fault tolerant computation with the sparse grid combination technique. SIAM J. Sci. Comput. 37(3), C331–C353 (2015)MathSciNetCrossRefMATH B. Harding et al., Fault tolerant computation with the sparse grid combination technique. SIAM J. Sci. Comput. 37(3), C331–C353 (2015)MathSciNetCrossRefMATH
17.
Zurück zum Zitat M. Heene, D. Pflüger, Efficient and scalable distributed-memory hierarchization algorithms for the sparse grid combination technique, in Parallel Computing: On the Road to Exascale (2016) M. Heene, D. Pflüger, Efficient and scalable distributed-memory hierarchization algorithms for the sparse grid combination technique, in Parallel Computing: On the Road to Exascale (2016)
18.
Zurück zum Zitat M. Heene, D. Pflüger, Scalable algorithms for the solution of higher-dimensional PDEs, in Software for Exascale Computing - SPPEXA 2013–2015, ed. by H.-J. Bungartz, P. Neumann, W.E. Nagel (Springer, Berlin, 2016), pp. 165–186CrossRef M. Heene, D. Pflüger, Scalable algorithms for the solution of higher-dimensional PDEs, in Software for Exascale Computing - SPPEXA 2013–2015, ed. by H.-J. Bungartz, P. Neumann, W.E. Nagel (Springer, Berlin, 2016), pp. 165–186CrossRef
19.
Zurück zum Zitat M. Heene, A.P. Hinojosa, H.J. Bungartz, D. Pflüger, A massively-parallel, fault-tolerant solver for high-dimensional PDEs, in Euro-Par 2016: Parallel Processing Workshops. Lecture Notes in Computer Science, vol. 10104 (Springer, Cham, 2016), pp. 635–647 M. Heene, A.P. Hinojosa, H.J. Bungartz, D. Pflüger, A massively-parallel, fault-tolerant solver for high-dimensional PDEs, in Euro-Par 2016: Parallel Processing Workshops. Lecture Notes in Computer Science, vol. 10104 (Springer, Cham, 2016), pp. 635–647
20.
Zurück zum Zitat M. Hegland, J. Garcke, V. Challis, The combination technique and some generalisations. Linear Algebra Appl. 420(2–3), 249–275 (2007)MathSciNetCrossRefMATH M. Hegland, J. Garcke, V. Challis, The combination technique and some generalisations. Linear Algebra Appl. 420(2–3), 249–275 (2007)MathSciNetCrossRefMATH
21.
Zurück zum Zitat A. Pan, J.W. Tschanz, S. Kundu, A low cost scheme for reducing silent data corruption in large arithmetic circuits, in IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems, 2008. DFTVS’08 (IEEE, Boston, 2008), pp. 343–351 A. Pan, J.W. Tschanz, S. Kundu, A low cost scheme for reducing silent data corruption in large arithmetic circuits, in IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems, 2008. DFTVS’08 (IEEE, Boston, 2008), pp. 343–351
22.
Zurück zum Zitat A. Parra Hinojosa et al., Towards a fault-tolerant, scalable implementation of GENE, in Proceedings of ICCE 2014. LNCSE (Springer, Berlin, 2015) A. Parra Hinojosa et al., Towards a fault-tolerant, scalable implementation of GENE, in Proceedings of ICCE 2014. LNCSE (Springer, Berlin, 2015)
23.
Zurück zum Zitat A. Parra Hinojosa et al., Handling silent data corruption with the sparse grid combination technique, in Proceedings of the SPPEXA Workshop. LNCSE (Springer, Berlin, 2016) A. Parra Hinojosa et al., Handling silent data corruption with the sparse grid combination technique, in Proceedings of the SPPEXA Workshop. LNCSE (Springer, Berlin, 2016)
25.
Zurück zum Zitat M. Snir, R.W. Wisniewski, J.A. Abraham, S.V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson et al., Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28, 129–173 (2014)CrossRef M. Snir, R.W. Wisniewski, J.A. Abraham, S.V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson et al., Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28, 129–173 (2014)CrossRef
26.
Zurück zum Zitat J. Walter, Design and implementation of a fault simulation layer for the combination technique on HPC systems. Master’s thesis, University of Stuttgart, 2016 J. Walter, Design and implementation of a fault simulation layer for the combination technique on HPC systems. Master’s thesis, University of Stuttgart, 2016
Metadaten
Titel
EXAHD: An Exa-Scalable Two-Level Sparse Grid Approach for Higher-Dimensional Problems in Plasma Physics and Beyond
verfasst von
Mario Heene
Alfredo Parra Hinojosa
Michael Obersteiner
Hans-Joachim Bungartz
Dirk Pflüger
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-68394-2_31