Skip to main content
Erschienen in: The Journal of Supercomputing 5/2016

01.05.2016

A novel MPI reduction algorithm resilient to imbalances in process arrival times

verfasst von: P. Marendic, J. Lemeire, D. Vucinic, P. Schelkens

Erschienen in: The Journal of Supercomputing | Ausgabe 5/2016

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Reduction algorithms are optimized only under the assumption that all processes commence the reduction simultaneously. Research on process arrival times has shown that this is rarely the case. Thus, all benchmarking methodologies that take into account only balanced arrival times might not portray a true picture of real-world algorithm performance. In this paper, we select a subset of four reduction algorithms frequently used by library implementations and evaluate their performance for both balanced and imbalanced process arrival times. The main contribution of this paper is a novel imbalance robust algorithm that uses pre-knowledge of process arrival times to construct reduction schedules. The performance of selected algorithms was empirically evaluated on a 128 node subset of the Partnership for Advanced Computing in Europe CURIE supercomputer. The reported results show that the new imbalance robust algorithm universally outperforms all the selected algorithms, whenever the reduction schedule is precomputed. We find that when the cost of schedule construction is included in the total runtime, the new algorithm outperforms the selected algorithms for problem sizes greater than 1 MiB.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
Helsim is an electromagnetic explicit 3D in situ-visualized resilient particle-in-cell simulator, developed in the Leuven Intel ExaScale Lab, Belgium. It is a combined multidisciplinary effort integrating astrophysics, linear solvers, runtime environment, in situ visualization and architectural optimization-focused simulations. It was developed to be a proto-app, showing a realistic example of trade-offs between computation and communication on a small, manageable code base with modern implementation techniques. It was implemented in C\(++\)11 utilizing the inlab Shark PGAS library for all distributed data structures and the Cobra library for load balancing and resiliency.
 
2
VSC muk is a tier-1 cluster machine at the Flemish Supercomputing Center. It has 528 compute nodes with two Xeon E5-2670 processors and 64 GiB of RAM memory. The nodes are interconnected with an FDR Infiniband interconnect in a fat tree topology (1:2 oversubscription).
 
Literatur
2.
Zurück zum Zitat Ferreira KB, Bridges P, Brightwell R (2008) Characterizing application sensitivity to OS interference using kernel-level noise injection. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing (SC’08). IEEE Press, Piscataway, pp 19:1–19:12 Ferreira KB, Bridges P, Brightwell R (2008) Characterizing application sensitivity to OS interference using kernel-level noise injection. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing (SC’08). IEEE Press, Piscataway, pp 19:1–19:12
3.
Zurück zum Zitat Faraj A, Patarasuk P, Yuan X (2008) A study of process arrival patterns for MPI collective operations. Int J Parallel Program 36(6):571–591 Faraj A, Patarasuk P, Yuan X (2008) A study of process arrival patterns for MPI collective operations. Int J Parallel Program 36(6):571–591
4.
Zurück zum Zitat Huang C, Lawlor O, Kale LV (2004) Adaptive MPI. In: Languages and compilers for parallel computing. Springer, New York, pp 306–322 Huang C, Lawlor O, Kale LV (2004) Adaptive MPI. In: Languages and compilers for parallel computing. Springer, New York, pp 306–322
5.
Zurück zum Zitat Mamidala A, Liu J, Panda DK (2004) Efficient barrier and allreduce on infiniband clusters using multicast and adaptive algorithms. In: Proceedings of the 2004 IEEE international conference on cluster computing (CLUSTER’04). IEEE Computer Society, Washington, DC, pp 135–144 Mamidala A, Liu J, Panda DK (2004) Efficient barrier and allreduce on infiniband clusters using multicast and adaptive algorithms. In: Proceedings of the 2004 IEEE international conference on cluster computing (CLUSTER’04). IEEE Computer Society, Washington, DC, pp 135–144
7.
Zurück zum Zitat Qian Y (2010) Design and evaluation of efficient collective communications on modern interconnects and multi-core clusters. Ph.D. thesis, Queen’s University, Kingston Qian Y (2010) Design and evaluation of efficient collective communications on modern interconnects and multi-core clusters. Ph.D. thesis, Queen’s University, Kingston
9.
Zurück zum Zitat Karp RM, Sahay A, Santos EE, Schauser KE (1993) Optimal broadcast and summation in the LogP model. In: Proceedings of the fifth annual ACM symposium on parallel algorithms and architectures. ACM, New York, pp 142–153 Karp RM, Sahay A, Santos EE, Schauser KE (1993) Optimal broadcast and summation in the LogP model. In: Proceedings of the fifth annual ACM symposium on parallel algorithms and architectures. ACM, New York, pp 142–153
10.
Zurück zum Zitat Louis-Claude Canon GA (2012) Scheduling associative reductions with homogenous costs when overlapping communications and computations. Tech. Rep. 7898, Inria Louis-Claude Canon GA (2012) Scheduling associative reductions with homogenous costs when overlapping communications and computations. Tech. Rep. 7898, Inria
11.
Zurück zum Zitat Rabenseifner R (2004) Optimization of collective reduction operations. In: Procs. of int. conf. on computational science (ICCS), pp 1–9 Rabenseifner R (2004) Optimization of collective reduction operations. In: Procs. of int. conf. on computational science (ICCS), pp 1–9
12.
Zurück zum Zitat Rabenseifner R, Trff JL (2004) More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems. In: EuroPVM/MPI, pp 36–46 Rabenseifner R, Trff JL (2004) More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems. In: EuroPVM/MPI, pp 36–46
13.
Zurück zum Zitat Patarasuk P, Yuan X (2009) Bandwidth optimal all-reduce algorithms for clusters of workstations. J Parallel Distrib Comput 69(2):117–124CrossRef Patarasuk P, Yuan X (2009) Bandwidth optimal all-reduce algorithms for clusters of workstations. J Parallel Distrib Comput 69(2):117–124CrossRef
14.
Zurück zum Zitat Jain N, Sabharwal Y (2010) Optimal bucket algorithms for large MPI collectives on torus interconnects. In: Proceedings of the 24th ACM international conference on supercomputing. ACM, New York, pp 27–36 Jain N, Sabharwal Y (2010) Optimal bucket algorithms for large MPI collectives on torus interconnects. In: Proceedings of the 24th ACM international conference on supercomputing. ACM, New York, pp 27–36
15.
Zurück zum Zitat Chan E, Heimlich M, Purkayastha A, van de Geijn R (2007) Collective communication: theory, practice, and experience. Concurr Comput Pract Exp 19(13):1749–1783CrossRef Chan E, Heimlich M, Purkayastha A, van de Geijn R (2007) Collective communication: theory, practice, and experience. Concurr Comput Pract Exp 19(13):1749–1783CrossRef
16.
Zurück zum Zitat Peterka T, Goodell D, Ross R, Shen H-W, Thakur R (2009) A configurable algorithm for parallel image-compositing applications. In: Proceedings of the conference on high performance computing networking, storage and analysis (SC’09). ACM, New York, pp 4:1–4:10. doi:10.1145/1654059.1654064 Peterka T, Goodell D, Ross R, Shen H-W, Thakur R (2009) A configurable algorithm for parallel image-compositing applications. In: Proceedings of the conference on high performance computing networking, storage and analysis (SC’09). ACM, New York, pp 4:1–4:10. doi:10.​1145/​1654059.​1654064
17.
Zurück zum Zitat Kendall W, Peterka T, Huang J, Shen H-W, Ross R (2010) Accelerating and benchmarking Radix-\(k\) image compositing at large scale. In: Proceedings of the 10th eurographics conference on parallel graphics and visualization (EG PGV’10). Eurographics Association, Aire-la-Ville, pp 101–110. doi:10.2312/EGPGV/EGPGV10/101-110 Kendall W, Peterka T, Huang J, Shen H-W, Ross R (2010) Accelerating and benchmarking Radix-\(k\) image compositing at large scale. In: Proceedings of the 10th eurographics conference on parallel graphics and visualization (EG PGV’10). Eurographics Association, Aire-la-Ville, pp 101–110. doi:10.​2312/​EGPGV/​EGPGV10/​101-110
18.
Zurück zum Zitat Pjesivac-Grbovic J, Angskun T, Bosilca G, Fagg GE, Dongarra GJJ (2005) Performance analysis of MPI collective operations. In: IEEE international parallel and distributed processing symposium Pjesivac-Grbovic J, Angskun T, Bosilca G, Fagg GE, Dongarra GJJ (2005) Performance analysis of MPI collective operations. In: IEEE international parallel and distributed processing symposium
19.
Zurück zum Zitat Sanders P, Speck J, Trff JL (2009) Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Comput 35(12):581–594. (Selected papers from the 14th European PVM/MPI users group meeting) Sanders P, Speck J, Trff JL (2009) Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Comput 35(12):581–594. (Selected papers from the 14th European PVM/MPI users group meeting)
20.
Zurück zum Zitat Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in MPICH. Int J High Perform Comput Appl 19(1):49–66CrossRef Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in MPICH. Int J High Perform Comput Appl 19(1):49–66CrossRef
21.
Zurück zum Zitat Hoefler T, Moor D (2014) Energy, memory, and runtime tradeoffs for implementing collective communication operations. J Supercomput Front Innov 1(2):58–75 Hoefler T, Moor D (2014) Energy, memory, and runtime tradeoffs for implementing collective communication operations. J Supercomput Front Innov 1(2):58–75
22.
Zurück zum Zitat Fabrizio Petrini SP, Kerbyson Darren J (2003) The case of the missing supercomputer performance: achieving optimal performance on the 8,192 processors of ASCI Q. In: Proceedings of the 2003 ACM/IEEE conference on supercomputing (SC’03), p 55 Fabrizio Petrini SP, Kerbyson Darren J (2003) The case of the missing supercomputer performance: achieving optimal performance on the 8,192 processors of ASCI Q. In: Proceedings of the 2003 ACM/IEEE conference on supercomputing (SC’03), p 55
23.
Zurück zum Zitat Agarwal S, Garg R, Vishnoi NK (2005) The impact of noise on the scaling of collectives: a theoretical approach. In: Proceedings of the 12th international conference on high performance computing (HiPC’05). Springer, Berlin, pp 280–289. doi:10.1007/11602569_31 Agarwal S, Garg R, Vishnoi NK (2005) The impact of noise on the scaling of collectives: a theoretical approach. In: Proceedings of the 12th international conference on high performance computing (HiPC’05). Springer, Berlin, pp 280–289. doi:10.​1007/​11602569_​31
24.
Zurück zum Zitat Hoefler T, Schneider T, Lumsdaine A (2010) Characterizing the influence of system noise on large-scale applications by simulation. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC’10). IEEE Computer Society, Washington, DC, pp 1–11. doi:10.1109/SC.2010.12 Hoefler T, Schneider T, Lumsdaine A (2010) Characterizing the influence of system noise on large-scale applications by simulation. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC’10). IEEE Computer Society, Washington, DC, pp 1–11. doi:10.​1109/​SC.​2010.​12
25.
Zurück zum Zitat Ghysels P, Ashby TJ, Meerbergen K, Vanroose W (2013) Hiding global communication latency in the gmres algorithm on massively parallel machines. SIAM J Sci Comput 35(1):C48–C71 Ghysels P, Ashby TJ, Meerbergen K, Vanroose W (2013) Hiding global communication latency in the gmres algorithm on massively parallel machines. SIAM J Sci Comput 35(1):C48–C71
26.
Zurück zum Zitat Ferreira KB, Bridges PG, Brightwell R, Pedretti KT (2013) The impact of system design parameters on application noise sensitivity. Clust Comput 16(1):117–129CrossRef Ferreira KB, Bridges PG, Brightwell R, Pedretti KT (2013) The impact of system design parameters on application noise sensitivity. Clust Comput 16(1):117–129CrossRef
27.
Zurück zum Zitat Eichenberger AE, Abraham SG (1995) Impact of load imbalance on the design of software barriers. In: Proceedings of the 1995 international conference on parallel processing, pp 63–72 Eichenberger AE, Abraham SG (1995) Impact of load imbalance on the design of software barriers. In: Proceedings of the 1995 international conference on parallel processing, pp 63–72
28.
Zurück zum Zitat Marendic P, Lemeire J, Haber T, Vucinic D, Schelkens P (2012) An investigation into the performance of reduction algorithms under load imbalance. In: Kaklamanis C, Papatheodorou T, Spirakis P (eds) Euro-Par 2012 parallel processing. Lecture notes in computer science, vol 7484. Springer, Berlin, pp 439–450. doi:10.1007/978-3-642-32820-6_44 Marendic P, Lemeire J, Haber T, Vucinic D, Schelkens P (2012) An investigation into the performance of reduction algorithms under load imbalance. In: Kaklamanis C, Papatheodorou T, Spirakis P (eds) Euro-Par 2012 parallel processing. Lecture notes in computer science, vol 7484. Springer, Berlin, pp 439–450. doi:10.​1007/​978-3-642-32820-6_​44
29.
Zurück zum Zitat Chan EW, Heimlich MF, Purkayastha A, Van De Geijn RA (2004) On optimizing collective communication. In: 2004 IEEE international conference on cluster computing. IEEE, pp 145–155 Chan EW, Heimlich MF, Purkayastha A, Van De Geijn RA (2004) On optimizing collective communication. In: 2004 IEEE international conference on cluster computing. IEEE, pp 145–155
30.
Zurück zum Zitat Träff JL, Ripke A (2008) Optimal broadcast for fully connected processor-node networks. J Parallel Distrib Comput 68(7):887–901CrossRefMATH Träff JL, Ripke A (2008) Optimal broadcast for fully connected processor-node networks. J Parallel Distrib Comput 68(7):887–901CrossRefMATH
31.
Zurück zum Zitat Lastovetsky A, Rychkov V, OFlynn M, Mpiblib (2008) Benchmarking MPI communications for parallel computing on homogeneous and heterogeneous clusters. In: Recent advances in parallel virtual machine and message passing interface. Springer, New York, pp 227–238 Lastovetsky A, Rychkov V, OFlynn M, Mpiblib (2008) Benchmarking MPI communications for parallel computing on homogeneous and heterogeneous clusters. In: Recent advances in parallel virtual machine and message passing interface. Springer, New York, pp 227–238
33.
Zurück zum Zitat Fredman ML, Sedgewick R, Sleator DD, Tarjan RE (1986) The pairing heap: a new form of self-adjusting heap. Algorithmica 1(1–4):111–129MathSciNetCrossRefMATH Fredman ML, Sedgewick R, Sleator DD, Tarjan RE (1986) The pairing heap: a new form of self-adjusting heap. Algorithmica 1(1–4):111–129MathSciNetCrossRefMATH
34.
Zurück zum Zitat Pettie S (2005) Towards a final analysis of pairing heaps. In: 46th annual IEEE symposium on foundations of computer science (FOCS’05). IEEE, pp 174–183 Pettie S (2005) Towards a final analysis of pairing heaps. In: 46th annual IEEE symposium on foundations of computer science (FOCS’05). IEEE, pp 174–183
35.
Zurück zum Zitat Ma K-L, Painter JS, Hansen CD, Krogh MF (1994) Parallel volume rendering using binary-swap compositing. Comput Graph Appl IEEE 14(4):59–68CrossRef Ma K-L, Painter JS, Hansen CD, Krogh MF (1994) Parallel volume rendering using binary-swap compositing. Comput Graph Appl IEEE 14(4):59–68CrossRef
36.
Zurück zum Zitat Yang D-L, Yu J-C, Chung Y-C (2001) Efficient compositing methods for the sort-last-sparse parallel volume rendering system on distributed memory multicomputers. J Supercomput 18(2):201–220. doi:10.1023/A:1008165001515 CrossRefMATH Yang D-L, Yu J-C, Chung Y-C (2001) Efficient compositing methods for the sort-last-sparse parallel volume rendering system on distributed memory multicomputers. J Supercomput 18(2):201–220. doi:10.​1023/​A:​1008165001515 CrossRefMATH
37.
Zurück zum Zitat Gropp W, Lusk E (1999) Reproducible measurements of MPI performance characteristics. Springer, New York, pp 11–18 Gropp W, Lusk E (1999) Reproducible measurements of MPI performance characteristics. Springer, New York, pp 11–18
39.
40.
Zurück zum Zitat Hoefler TST, Lumsdaine A (2010) Accurately measuring overhead, communication time and progression of blocking and nonblocking collective operations at massive scale. Int J Parallel Emerg Distrib Syst 25(4):241–258MathSciNetCrossRefMATH Hoefler TST, Lumsdaine A (2010) Accurately measuring overhead, communication time and progression of blocking and nonblocking collective operations at massive scale. Int J Parallel Emerg Distrib Syst 25(4):241–258MathSciNetCrossRefMATH
41.
Zurück zum Zitat Gropp W, Lusk E (1999) Reproducible measurements of mpi performance characteristics. In: Recent advances in parallel virtual machine and message passing interface. Springer, New York, pp 11–18 Gropp W, Lusk E (1999) Reproducible measurements of mpi performance characteristics. In: Recent advances in parallel virtual machine and message passing interface. Springer, New York, pp 11–18
42.
Zurück zum Zitat Reussner R, Sanders P, Träff JL, Skampi (2002) A comprehensive benchmark for public benchmarking of MPI. Sci Program 10(1):55–65 Reussner R, Sanders P, Träff JL, Skampi (2002) A comprehensive benchmark for public benchmarking of MPI. Sci Program 10(1):55–65
43.
Zurück zum Zitat Grove DA, Coddington PD (2005) Communication benchmarking and performance modelling of mpi programs on cluster computers. J Supercomput 34(2):201–217CrossRef Grove DA, Coddington PD (2005) Communication benchmarking and performance modelling of mpi programs on cluster computers. J Supercomput 34(2):201–217CrossRef
44.
Zurück zum Zitat Träff JL (2012) Mpicroscope: towards an MPI benchmark tool for performance guideline verification. In: Recent advances in the message passing interface—proceedings of 19th European MPI users’ group meeting (EuroMPI’12), Austria, pp 100–109. doi:10.1007/978-3-642-33518-1_15 Träff JL (2012) Mpicroscope: towards an MPI benchmark tool for performance guideline verification. In: Recent advances in the message passing interface—proceedings of 19th European MPI users’ group meeting (EuroMPI’12), Austria, pp 100–109. doi:10.​1007/​978-3-642-33518-1_​15
45.
Zurück zum Zitat Hunold S, Carpen-Amarie A, Träff JL (2014) Reproducible MPI micro-benchmarking isn’t as easy as you think. In: Proceedings of the 21st European MPI users’ group meeting. ACM, New York, p 69 Hunold S, Carpen-Amarie A, Träff JL (2014) Reproducible MPI micro-benchmarking isn’t as easy as you think. In: Proceedings of the 21st European MPI users’ group meeting. ACM, New York, p 69
47.
Zurück zum Zitat Buranapanichkit D, Deligiannis N, Andreopoulos Y (2015) Convergence of desynchronization primitives in wireless sensor networks: stochastic modeling approach. CoRR. arXiv:1411.2862 Buranapanichkit D, Deligiannis N, Andreopoulos Y (2015) Convergence of desynchronization primitives in wireless sensor networks: stochastic modeling approach. CoRR. arXiv:​1411.​2862
48.
Zurück zum Zitat Deligiannis N, Mota JF, Smart G, Andreopoulos Y (2015) Fast desynchronization for decentralized multichannel medium access control. IEEE Trans Commun 63(9):3336–3349CrossRef Deligiannis N, Mota JF, Smart G, Andreopoulos Y (2015) Fast desynchronization for decentralized multichannel medium access control. IEEE Trans Commun 63(9):3336–3349CrossRef
Metadaten
Titel
A novel MPI reduction algorithm resilient to imbalances in process arrival times
verfasst von
P. Marendic
J. Lemeire
D. Vucinic
P. Schelkens
Publikationsdatum
01.05.2016
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 5/2016
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-016-1707-x

Weitere Artikel der Ausgabe 5/2016

The Journal of Supercomputing 5/2016 Zur Ausgabe