Skip to main content
Erschienen in: The Journal of Supercomputing 11/2015

01.11.2015

Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms

verfasst von: Khalid Hasanov, Jean-Noël Quintin, Alexey Lastovetsky

Erschienen in: The Journal of Supercomputing | Ausgabe 11/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Many state-of-the-art parallel algorithms, which are widely used in scientific applications executed on high-end computing systems, were designed in the twentieth century with relatively small-scale parallelism in mind. Indeed, while in 1990s a system with few hundred cores was considered a powerful supercomputer, modern top supercomputers have millions of cores. In this paper, we present a hierarchical approach to optimization of message-passing parallel algorithms for execution on large-scale distributed-memory systems. The idea is to reduce the communication cost by introducing hierarchy and hence more parallelism in the communication scheme. We apply this approach to SUMMA, the state-of-the-art parallel algorithm for matrix–matrix multiplication, and demonstrate both theoretically and experimentally that the modified Hierarchical SUMMA significantly improves the communication cost and the overall performance on large-scale platforms.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat van de Geijn RA, Jerrell W (1997) SUMMA: scalable universal matrix multiplication algorithm. Concurr Pract Exp 9(4):255–274 van de Geijn RA, Jerrell W (1997) SUMMA: scalable universal matrix multiplication algorithm. Concurr Pract Exp 9(4):255–274
3.
Zurück zum Zitat Beaumont O, Boudet V, Rastello F, Robert Y (2001) Matrix multiplication on heterogeneous platforms. IEEE Trans Parallel Distrib Syst 12(10):1033–1051MathSciNetCrossRef Beaumont O, Boudet V, Rastello F, Robert Y (2001) Matrix multiplication on heterogeneous platforms. IEEE Trans Parallel Distrib Syst 12(10):1033–1051MathSciNetCrossRef
4.
Zurück zum Zitat Lastovetsky A, Dongarra J (2009) High performance heterogeneous computing. Wiley, New York Lastovetsky A, Dongarra J (2009) High performance heterogeneous computing. Wiley, New York
5.
Zurück zum Zitat Gustavson FG (2012) Cache blocking for linear algebra algorithms. Parallel processing and applied mathematics. In: Lecture Notes in Computer Science, vol 7203. Springer, Berlin, pp 122–132 Gustavson FG (2012) Cache blocking for linear algebra algorithms. Parallel processing and applied mathematics. In: Lecture Notes in Computer Science, vol 7203. Springer, Berlin, pp 122–132
6.
Zurück zum Zitat Frigo M, Leiserson CE, Prokop H, Ramachandran S (1999) Cache-oblivious algorithms. In: Proceedings of the 40th annual symposium on foundations of computer science, FOCS ’99. IEEE Computer Society, Washington, DC, USA, p 285 Frigo M, Leiserson CE, Prokop H, Ramachandran S (1999) Cache-oblivious algorithms. In: Proceedings of the 40th annual symposium on foundations of computer science, FOCS ’99. IEEE Computer Society, Washington, DC, USA, p 285
7.
Zurück zum Zitat Yotov K, Roeder T, Pingali K, Gunnels J, Gustavson F (2007) An experimental comparison of cache-oblivious and cache-conscious programs. In: Proceedings of the nineteenth annual ACM symposium on parallel algorithms and srchitectures., SPAA ’07ACM, New York, NY, USA, pp 93–104 Yotov K, Roeder T, Pingali K, Gunnels J, Gustavson F (2007) An experimental comparison of cache-oblivious and cache-conscious programs. In: Proceedings of the nineteenth annual ACM symposium on parallel algorithms and srchitectures., SPAA ’07ACM, New York, NY, USA, pp 93–104
8.
Zurück zum Zitat Chatterjee S, Lebeck AR, Patnala PK, Mithuna T (2002) Recursive array layouts and fast matrix multiplication. IEEE Trans Parallel Distrib Syst 13(11):1105–1123CrossRef Chatterjee S, Lebeck AR, Patnala PK, Mithuna T (2002) Recursive array layouts and fast matrix multiplication. IEEE Trans Parallel Distrib Syst 13(11):1105–1123CrossRef
10.
Zurück zum Zitat Clint WR, Dongarra JJ (1998) Automatically tuned linear algebra software. Proceedings of the 1998 ACM/IEEE conference on supercomputing. Supercomputing ’98IEEE Computer Society, Washington, DC, USA, pp 1–27 Clint WR, Dongarra JJ (1998) Automatically tuned linear algebra software. Proceedings of the 1998 ACM/IEEE conference on supercomputing. Supercomputing ’98IEEE Computer Society, Washington, DC, USA, pp 1–27
11.
Zurück zum Zitat Goto K, van De Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):1–25 Goto K, van De Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):1–25
12.
Zurück zum Zitat Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Bozeman, MT, USA Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Bozeman, MT, USA
13.
Zurück zum Zitat Fox GC, Otto SW, Hey AJG (1987) Matrix algorithms on a hypercube I: matrix multiplication. Parallel Comput 4(1):17–31CrossRefMATH Fox GC, Otto SW, Hey AJG (1987) Matrix algorithms on a hypercube I: matrix multiplication. Parallel Comput 4(1):17–31CrossRefMATH
14.
Zurück zum Zitat Jaeyoung C, Walker DW, Dongarra J (1994) PUMMA: parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Concurr Pract Exp 6(7):543–570 Jaeyoung C, Walker DW, Dongarra J (1994) PUMMA: parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Concurr Pract Exp 6(7):543–570
15.
Zurück zum Zitat Huss-Lederman S, Jacobson E, Tsao A, Zhang G (1994) Matrix multiplication on the Intel Touchstone Delta. Concurr Pract Exp 6(7):571–594CrossRef Huss-Lederman S, Jacobson E, Tsao A, Zhang G (1994) Matrix multiplication on the Intel Touchstone Delta. Concurr Pract Exp 6(7):571–594CrossRef
16.
Zurück zum Zitat Agarwal RC, Balle SM, Gustavson FG, Joshi M, Palkar P (1995) A three-dimensional approach to parallel matrix multiplication. IBM J Res Dev 39(5):575–582CrossRef Agarwal RC, Balle SM, Gustavson FG, Joshi M, Palkar P (1995) A three-dimensional approach to parallel matrix multiplication. IBM J Res Dev 39(5):575–582CrossRef
17.
Zurück zum Zitat Agarwal RC, Gustavson FG, Zubair M (1994) A High-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication. IBM J Res Dev 38(6):673–681CrossRef Agarwal RC, Gustavson FG, Zubair M (1994) A High-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication. IBM J Res Dev 38(6):673–681CrossRef
18.
Zurück zum Zitat Blackford LS, Choi J, Cleary A, D’Azeuedo E, Demmel J, Dhillon I, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user’s guide. Society for industrial and applied mathematics, PhiladelphiaCrossRef Blackford LS, Choi J, Cleary A, D’Azeuedo E, Demmel J, Dhillon I, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user’s guide. Society for industrial and applied mathematics, PhiladelphiaCrossRef
19.
Zurück zum Zitat Jaeyoung C (1997) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. In: High Performance Computing on the Information Superhighway, 1997. HPC, Asia ’97, pp 224–229 Jaeyoung C (1997) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. In: High Performance Computing on the Information Superhighway, 1997. HPC, Asia ’97, pp 224–229
20.
Zurück zum Zitat Krishnan M, Nieplocha J (2004) SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. In: Proceedings of parallel and distributed processing symposium Krishnan M, Nieplocha J (2004) SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. In: Proceedings of parallel and distributed processing symposium
21.
Zurück zum Zitat Solomonik E, Demmel J (2011)Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Euro-Par (2), Lecture Notes in Computer Science, vol 6853. Springer, Berlin, pp 90–109 Solomonik E, Demmel J (2011)Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Euro-Par (2), Lecture Notes in Computer Science, vol 6853. Springer, Berlin, pp 90–109
22.
Zurück zum Zitat U.S.Department of Energy: Exascale Programming Challenges. ASCR Exascale Programming Challenges Workshop (2011) U.S.Department of Energy: Exascale Programming Challenges. ASCR Exascale Programming Challenges Workshop (2011)
24.
Zurück zum Zitat Barnett M, Gupta S, Payne DG, Shuler L, Robert A, van de Geijn, Watts J (1994) Interprocessor collective communication library (InterCom). In: Proceedings of the scalable high performance computing conference. IEEE Computer Society Press, New York, pp 357–364 Barnett M, Gupta S, Payne DG, Shuler L, Robert A, van de Geijn, Watts J (1994) Interprocessor collective communication library (InterCom). In: Proceedings of the scalable high performance computing conference. IEEE Computer Society Press, New York, pp 357–364
25.
Zurück zum Zitat Patarasuk P, Yuan X, Faraj A (2008) Techniques for pipelined broadcast on ethernet switched clusters. J Parallel Distrib Comput 68(6):809–824CrossRefMATH Patarasuk P, Yuan X, Faraj A (2008) Techniques for pipelined broadcast on ethernet switched clusters. J Parallel Distrib Comput 68(6):809–824CrossRefMATH
26.
Zurück zum Zitat Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in MPICH. Int J High Perform Comput Appl 19(1):49–66CrossRef Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in MPICH. Int J High Perform Comput Appl 19(1):49–66CrossRef
27.
Zurück zum Zitat Scott DS (1991) Efficient all-to-all communication patterns in hypercube and mesh topologies. In: Proceedings of the sixth conference distributed memory computing, pp 398–403 Scott DS (1991) Efficient all-to-all communication patterns in hypercube and mesh topologies. In: Proceedings of the sixth conference distributed memory computing, pp 398–403
28.
Zurück zum Zitat Graham RL, Venkata MG, Ladd J, Shamis P, Rabinovitz I, Filipov V, Shainer G (2011) Cheetah: a framework for scalable hierarchical collective operations. CCGRID, pp 73–83 Graham RL, Venkata MG, Ladd J, Shamis P, Rabinovitz I, Filipov V, Shainer G (2011) Cheetah: a framework for scalable hierarchical collective operations. CCGRID, pp 73–83
29.
Zurück zum Zitat Almási G, Heidelberger P, Archer CJ, Martorell X, Erway CC, Moreira JE, Steinmacher-Burow B, Zheng Y (2005) Optimization of MPI collective communication on BlueGene/L systems. In: Proceedings of the 19th annual international conference on supercomputing., ICS ’05ACM, New York, NY, USA, pp 253–262 Almási G, Heidelberger P, Archer CJ, Martorell X, Erway CC, Moreira JE, Steinmacher-Burow B, Zheng Y (2005) Optimization of MPI collective communication on BlueGene/L systems. In: Proceedings of the 19th annual international conference on supercomputing., ICS ’05ACM, New York, NY, USA, pp 253–262
30.
Zurück zum Zitat Kumar S, Dozsa G, Almasi G, Heidelberger P, Chen D, Giampapa ME, Blocksome M, Faraj A, Parker J, Ratterman J, Smith B, Archer CJ (2008) The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In: Proceedings of the 22nd annual international conference on supercomputing., ICS ’08ACM, New York, NY, USA, pp 94–103 Kumar S, Dozsa G, Almasi G, Heidelberger P, Chen D, Giampapa ME, Blocksome M, Faraj A, Parker J, Ratterman J, Smith B, Archer CJ (2008) The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In: Proceedings of the 22nd annual international conference on supercomputing., ICS ’08ACM, New York, NY, USA, pp 94–103
31.
Zurück zum Zitat Hoefler T, Siebert C, Rehm W (2007) A practically constant-time MPI broadcast algorithm for large-scale InfiniBand clusters with multicast. In: IPDPS, IEEE, New York, pp 1–8 Hoefler T, Siebert C, Rehm W (2007) A practically constant-time MPI broadcast algorithm for large-scale InfiniBand clusters with multicast. In: IPDPS, IEEE, New York, pp 1–8
32.
Zurück zum Zitat Liu J, Wu J, Panda DK (2004) High performance RDMA-based MPI implementation over InfiniBand. Int J Parallel Progr 32(3):167–198CrossRefMATH Liu J, Wu J, Panda DK (2004) High performance RDMA-based MPI implementation over InfiniBand. Int J Parallel Progr 32(3):167–198CrossRefMATH
33.
Zurück zum Zitat Hockney RW (1994) The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput 20(3):389–398CrossRef Hockney RW (1994) The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput 20(3):389–398CrossRef
34.
Zurück zum Zitat Pjes̆ivac-Grbović J (2007) Towards Automatic and Adaptive Optimizations of MPI Collective Operations. Ph.D. thesis, University of Tennessee, Knoxville Pjes̆ivac-Grbović J (2007) Towards Automatic and Adaptive Optimizations of MPI Collective Operations. Ph.D. thesis, University of Tennessee, Knoxville
36.
Zurück zum Zitat Gabriel E, Fagg G, Bosilca G, Angskun T, Dongarra J, Squyres J, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain R, Daniel D, Graham R, Woodall T (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, pp 97–104 Gabriel E, Fagg G, Bosilca G, Angskun T, Dongarra J, Squyres J, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain R, Daniel D, Graham R, Woodall T (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, pp 97–104
37.
Zurück zum Zitat Quintin J., Hasanov K, Lastovetsky A (2013) Hierarchical parallel matrix multiplication on large-scale distributed memory platforms. In: 42nd International conference on parallel processing (ICPP 2013). IEEE, New York, pp 754–762 Quintin J., Hasanov K, Lastovetsky A (2013) Hierarchical parallel matrix multiplication on large-scale distributed memory platforms. In: 42nd International conference on parallel processing (ICPP 2013). IEEE, New York, pp 754–762
38.
Zurück zum Zitat Kondo M (2012) Report on Exascale Architecture. In: IESP Meeting, Japan Kondo M (2012) Report on Exascale Architecture. In: IESP Meeting, Japan
40.
Zurück zum Zitat Balaji P, Gupta R, Vishnu A, Beckman P (2011) Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems. Comput Sci R D 26(3–4):247–256CrossRef Balaji P, Gupta R, Vishnu A, Beckman P (2011) Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems. Comput Sci R D 26(3–4):247–256CrossRef
41.
Zurück zum Zitat Blackford LS, Whaley RC (1998) ScaLAPACK Evaluation and Performance at the DoD MSRCs. Tech. Rep. LAPACK Working Note No. 136, Technical Report UT CS-98-388, University of Tennessee, Knoxville, TN (1998) Blackford LS, Whaley RC (1998) ScaLAPACK Evaluation and Performance at the DoD MSRCs. Tech. Rep. LAPACK Working Note No. 136, Technical Report UT CS-98-388, University of Tennessee, Knoxville, TN (1998)
Metadaten
Titel
Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms
verfasst von
Khalid Hasanov
Jean-Noël Quintin
Alexey Lastovetsky
Publikationsdatum
01.11.2015
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 11/2015
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-014-1133-x

Weitere Artikel der Ausgabe 11/2015

The Journal of Supercomputing 11/2015 Zur Ausgabe

Premium Partner