Top

The Journal of Supercomputing

Published in:

30-05-2020

Applying the swept rule for solving explicit partial differential equations on heterogeneous computing systems

Authors: Daniel J. Magee, Anthony S. Walker, Kyle E. Niemeyer

Published in: The Journal of Supercomputing | Issue 2/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Applications that exploit the architectural details of high-performance computing (HPC) systems have become increasingly invaluable in academia and industry over the past two decades. The most important hardware development of the last decade in HPC has been the general purpose graphics processing unit (GPGPU), a class of massively parallel devices that now contributes the majority of computational power in the top 500 supercomputers. As these systems grow, small costs such as latency—due to the fixed cost of memory accesses and communication—accumulate in a large simulation and become a significant barrier to performance. The swept time-space decomposition rule is a communication-avoiding technique for time-stepping stencil update formulas that attempts to reduce latency costs. This work extends the swept rule by targeting heterogeneous, CPU/GPU architectures representing current and future HPC systems. We compare our approach to a naive decomposition scheme with two test equations using an MPI+CUDA pattern on 40 processes over two nodes containing one GPU. The swept rule produces a factor of 1.9 to 23 speedup for the heat equation and a factor of 1.1 to 2.0 speedup for the Euler equations, using the same processors and work distribution, and with the best possible configurations. These results show the potential effectiveness of the swept rule for different equations and numerical schemes on massively parallel compute systems that incur substantial latency costs.

previous article Research on navigation of bidirectional A* algorithm based on ant colony algorithm

next article An AI-based intelligent system for healthcare analysis using Ridge-Adaline Stochastic Gradient Descent Classifier

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Alhubail and Wang [3] use the term “straight” decomposition where we use Classic.

Here we use “block” and “domain” interchangeably to represent a domain of dependence; the term comes from the GPU/CUDA construct representing a collection of threads.

Alexandrov V (2016) Route to exascale: novel mathematical methods, scalable algorithms and computational science skills. J Comput Sci 14:1–4. https://doi.org/10.1016/j.jocs.2016.04.014CrossRef

Alhubail M, Wang Q (2015) K-S_1D_Swept, git commit e575d73. https://github.com/hubailmm/K-S_1D_Swept

Alhubail M, Wang Q (2016) The swept rule for breaking the latency barrier in time advancing PDEs. J Comput Phys 307:110–121. https://doi.org/10.1016/j.jcp.2015.11.026MathSciNetCrossRefMATH

Alhubail M, Wang Q (2017) Improving the strong parallel scalability of CFD schemes via the swept domain decomposition rule. In: 55th AIAA Aerospace Sciences Meeting, Grapevine, Texas, American Institute of Aeronautics and Astronautics, January 2017. https://doi.org/10.2514/6.2017-1218

Alhubail MM, Wang Q, Williams J (2016) The swept rule for breaking the latency barrier in time advancing two-dimensional PDEs. arXiv:1602.07558 [cs.NA]

Baboulin M, Donfack S, Dongarra J, Grigori L, Rémy A, Tomov S (2012) A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines. Procedia Comput Sci 9:17–26. https://doi.org/10.1016/j.procs.2012.04.003CrossRef

Ballard G, Demmel J, Holtz O, Schwartz O (2011) Minimizing communication in numerical linear algebra. SIAM J Matrix Anal Appl 32(3):866–901MathSciNetCrossRef

Clarke L, Glendinning I, Hempel R (1994) The MPI message passing interface standard. Programming environments for massively parallel distributed systems. Birkhäuser, Basel, pp 213–218. https://doi.org/10.1007/978-3-0348-8534-8_21CrossRef

Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, IEEE Press, Piscataway, NJ, USA, pp 4:1–4:12

10.

Demmel J, Hoemmen M, Mohiyuddin M, Yelick K (2008) Avoiding communication in sparse matrix computations. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, IEEE, pp 1–12

11.

Emmett M, Minion M (2012) Toward an efficient parallel in time method for partial differential equations. Commun Appl Math Comput Sci 7(1):105–132MathSciNetCrossRef

12.

Falgout RD, Manteuffel TA, O’Neill B, Schroder JB (2017) Multigrid reduction in time for nonlinear parabolic problems: a case study. SIAM J Sci Comput 39(5):S298–S322. https://doi.org/10.1137/16M1082330MathSciNetCrossRefMATH

13.

Falgout RD, Friedhoff S, Kolev TV, MacLachlan SP, Schroder JB (2014) Parallel time integration with multigrid. SIAM J Sci Comput 36(6):C635–C661MathSciNetCrossRef

14.

Gander MJ (2015) 50 years of time parallel time integration. In: Carraro T, Geiger M, Körkel S, Rannacher R (eds) Multiple Shooting and Time Domain Decomposition Methods, vol 9. Contributions in Mathematical and Computational Sciences. Cham, Springer, pp 69–113. https://doi.org/10.1007/978-3-319-23321-5_3CrossRef

15.

Gander MJ, Güttel S (2013) Paraexp: a parallel integrator for linear initial-value problems. SIAM J Sci Comput 35(2):C123–C142MathSciNetCrossRef

16.

Gander MJ, Neumuller M (2016) Analysis of a new space-time parallel multigrid algorithm for parabolic problems. SIAM J Sci Comput 38(4):A2173–A2208MathSciNetCrossRef

17.

Huerta YA, Swartz B, Lilja DJ (2017) Determining work partitioning on closely coupled heterogeneous computing systems using statistical design of experiments. In: 2017 IEEE International Symposium on Workload Characterization (IISWC), October 2017, pp 118–119. https://doi.org/10.1109/IISWC.2017.8167766

18.

Jacobsen DA, Senocak I (2013) Multi-level parallelism for incompressible flow computations on GPU clusters. Parallel Comput 39(1):1–20. https://doi.org/10.1016/j.parco.2012.10.002MathSciNetCrossRef

19.

Khabou A, Demmel JW, Grigori L, Gu M (2013) LU factorization with panel rank revealing pivoting and its communication avoiding version. SIAM J Matrix Anal Appl 34(3):1401–1429MathSciNetCrossRef

20.

Lions J-L, Maday Y, Turinici G (2001) R ’e solution of edp by a sch é ma en temps guillemotleft parar ’e el guillemotright. Proc Acad Sci Ser I Math 332(7):661–668

21.

Lu F, Song J, Yin F, Zhu X (2012) Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters. Comput Phys Commun 183(6):1172–1181. https://doi.org/10.1016/j.cpc.2012.01.019CrossRef

22.

Magee DJ, Niemeyer KE (2018) Accelerating solutions of one-dimensional unsteady PDEs with GPU-based swept time-space decomposition. J Comput Phys 357:338–352. https://doi.org/10.1016/j.jcp.2017.12.028MathSciNetCrossRefMATH

23.

Magee DJ, Niemeyer KE (2018) Niemeyer-Research-Group/hSweep: MS Thesis Official, June 2018. https://doi.org/10.5281/zenodo.1291212

24.

Magee DJ, Walker AS, Niemeyer KE (2020) Data, plotting scripts, and figures for “Applying the swept rule for explicit partial differential equation solutions on heterogeneous computing systems’. Zenodo. https://doi.org/10.5281/zenodo.3787144CrossRef

25.

Malas T, Hager G, Ltaief H, Stengel H, Wellein G, Keyes D (2015) Multicore-optimized wavefront diamond blocking for optimizing stencil updates. SIAM J Sci Comput 37(4):C439–C464. https://doi.org/10.1137/140991133MathSciNetCrossRefMATH

26.

Mills RT, Rupp K, Adams M, Brown J, Isaac T, Knepley M, Smith B, Zhang H (2017) Software strategy and experiences with manycore processor support in PETSc. In: SIAM Pacific Northwest Regional Conference, October 2017

27.

Minion ML, Speck R, Bolten M, Emmett M, Ruprecht D (2015) Interweaving PFASST and parallel multigrid. SIAM J Sci Comput 37(5):S244–S263MathSciNetCrossRef

28.

Solomonik E, Ballard G, Demmel J, Hoefler T (2017) A communication-avoiding parallel algorithm for the symmetric eigenvalue problem. In: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pp 111–121

29.

Wang Q (2017) Decomposition of stencil update formula into atomic stages. arXiv:1606.00721 [math.NA]

30.

Wu S-L, Zhou T (2018) Parareal algorithms with local time-integrators for time fractional differential equations. J Comput Phys 358:135–149. https://doi.org/10.1016/j.jcp.2017.12.029MathSciNetCrossRefMATH

Title: Applying the swept rule for solving explicit partial differential equations on heterogeneous computing systems
Authors: Daniel J. Magee
Anthony S. Walker
Kyle E. Niemeyer
Publication date: 30-05-2020
Publisher: Springer US
Published in: The Journal of Supercomputing / Issue 2/2021
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-020-03340-9

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 2/2021

Optimal demultiplexer unit design and energy estimation using quantum dot cellular automata

α-Probabilistic flexible aggregate nearest neighbor search in road networks using landmark multidimensional scaling

Optimal business process deployment cost in cloud resources

Ramanujan graphs and the spectral gap of supercomputing topologies

Designing nanotechnology QCA–multiplexer using majority function-based NAND for quantum computing

Terminal and broadcast reliability analysis of direct 2-D symmetric torus network

Premium Partner