nach oben

The Journal of Supercomputing

Erschienen in:

01.11.2015

Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms

verfasst von: Khalid Hasanov, Jean-Noël Quintin, Alexey Lastovetsky

Erschienen in: The Journal of Supercomputing | Ausgabe 11/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Many state-of-the-art parallel algorithms, which are widely used in scientific applications executed on high-end computing systems, were designed in the twentieth century with relatively small-scale parallelism in mind. Indeed, while in 1990s a system with few hundred cores was considered a powerful supercomputer, modern top supercomputers have millions of cores. In this paper, we present a hierarchical approach to optimization of message-passing parallel algorithms for execution on large-scale distributed-memory systems. The idea is to reduce the communication cost by introducing hierarchy and hence more parallelism in the communication scheme. We apply this approach to SUMMA, the state-of-the-art parallel algorithm for matrix–matrix multiplication, and demonstrate both theoretically and experimentally that the modified Hierarchical SUMMA significantly improves the communication cost and the overall performance on large-scale platforms.

Nächster Artikel Comments on two schemes of identity-based user authentication and key agreement for mobile client–server networks

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Top 500 supercomputer sites. http://www.top500.org/

van de Geijn RA, Jerrell W (1997) SUMMA: scalable universal matrix multiplication algorithm. Concurr Pract Exp 9(4):255–274

Beaumont O, Boudet V, Rastello F, Robert Y (2001) Matrix multiplication on heterogeneous platforms. IEEE Trans Parallel Distrib Syst 12(10):1033–1051MathSciNetCrossRef

Lastovetsky A, Dongarra J (2009) High performance heterogeneous computing. Wiley, New York

Gustavson FG (2012) Cache blocking for linear algebra algorithms. Parallel processing and applied mathematics. In: Lecture Notes in Computer Science, vol 7203. Springer, Berlin, pp 122–132

Frigo M, Leiserson CE, Prokop H, Ramachandran S (1999) Cache-oblivious algorithms. In: Proceedings of the 40th annual symposium on foundations of computer science, FOCS ’99. IEEE Computer Society, Washington, DC, USA, p 285

Yotov K, Roeder T, Pingali K, Gunnels J, Gustavson F (2007) An experimental comparison of cache-oblivious and cache-conscious programs. In: Proceedings of the nineteenth annual ACM symposium on parallel algorithms and srchitectures., SPAA ’07ACM, New York, NY, USA, pp 93–104

Chatterjee S, Lebeck AR, Patnala PK, Mithuna T (2002) Recursive array layouts and fast matrix multiplication. IEEE Trans Parallel Distrib Syst 13(11):1105–1123CrossRef

Basic Linear Algebra Routines (BLAS). http://www.netlib.org/blas/

10.

Clint WR, Dongarra JJ (1998) Automatically tuned linear algebra software. Proceedings of the 1998 ACM/IEEE conference on supercomputing. Supercomputing ’98IEEE Computer Society, Washington, DC, USA, pp 1–27

11.

Goto K, van De Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):1–25

12.

Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Bozeman, MT, USA

13.

Fox GC, Otto SW, Hey AJG (1987) Matrix algorithms on a hypercube I: matrix multiplication. Parallel Comput 4(1):17–31CrossRefMATH

14.

Jaeyoung C, Walker DW, Dongarra J (1994) PUMMA: parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Concurr Pract Exp 6(7):543–570

15.

Huss-Lederman S, Jacobson E, Tsao A, Zhang G (1994) Matrix multiplication on the Intel Touchstone Delta. Concurr Pract Exp 6(7):571–594CrossRef

16.

Agarwal RC, Balle SM, Gustavson FG, Joshi M, Palkar P (1995) A three-dimensional approach to parallel matrix multiplication. IBM J Res Dev 39(5):575–582CrossRef

17.

Agarwal RC, Gustavson FG, Zubair M (1994) A High-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication. IBM J Res Dev 38(6):673–681CrossRef

18.

Blackford LS, Choi J, Cleary A, D’Azeuedo E, Demmel J, Dhillon I, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user’s guide. Society for industrial and applied mathematics, PhiladelphiaCrossRef

19.

Jaeyoung C (1997) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. In: High Performance Computing on the Information Superhighway, 1997. HPC, Asia ’97, pp 224–229

20.

Krishnan M, Nieplocha J (2004) SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. In: Proceedings of parallel and distributed processing symposium

21.

Solomonik E, Demmel J (2011)Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Euro-Par (2), Lecture Notes in Computer Science, vol 6853. Springer, Berlin, pp 90–109

22.

U.S.Department of Energy: Exascale Programming Challenges. ASCR Exascale Programming Challenges Workshop (2011)

23.

Message passing interface forum. http://www.mpi-forum.org/

24.

Barnett M, Gupta S, Payne DG, Shuler L, Robert A, van de Geijn, Watts J (1994) Interprocessor collective communication library (InterCom). In: Proceedings of the scalable high performance computing conference. IEEE Computer Society Press, New York, pp 357–364

25.

Patarasuk P, Yuan X, Faraj A (2008) Techniques for pipelined broadcast on ethernet switched clusters. J Parallel Distrib Comput 68(6):809–824CrossRefMATH

26.

Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in MPICH. Int J High Perform Comput Appl 19(1):49–66CrossRef

27.

Scott DS (1991) Efficient all-to-all communication patterns in hypercube and mesh topologies. In: Proceedings of the sixth conference distributed memory computing, pp 398–403

28.

Graham RL, Venkata MG, Ladd J, Shamis P, Rabinovitz I, Filipov V, Shainer G (2011) Cheetah: a framework for scalable hierarchical collective operations. CCGRID, pp 73–83

29.

Almási G, Heidelberger P, Archer CJ, Martorell X, Erway CC, Moreira JE, Steinmacher-Burow B, Zheng Y (2005) Optimization of MPI collective communication on BlueGene/L systems. In: Proceedings of the 19th annual international conference on supercomputing., ICS ’05ACM, New York, NY, USA, pp 253–262

30.

Kumar S, Dozsa G, Almasi G, Heidelberger P, Chen D, Giampapa ME, Blocksome M, Faraj A, Parker J, Ratterman J, Smith B, Archer CJ (2008) The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In: Proceedings of the 22nd annual international conference on supercomputing., ICS ’08ACM, New York, NY, USA, pp 94–103

31.

Hoefler T, Siebert C, Rehm W (2007) A practically constant-time MPI broadcast algorithm for large-scale InfiniBand clusters with multicast. In: IPDPS, IEEE, New York, pp 1–8

32.

Liu J, Wu J, Panda DK (2004) High performance RDMA-based MPI implementation over InfiniBand. Int J Parallel Progr 32(3):167–198CrossRefMATH

33.

Hockney RW (1994) The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput 20(3):389–398CrossRef

34.

Pjes̆ivac-Grbović J (2007) Towards Automatic and Adaptive Optimizations of MPI Collective Operations. Ph.D. thesis, University of Tennessee, Knoxville

35.

MPICH-A Portable Implementation of MPI. http://www.mpich.org/

36.

Gabriel E, Fagg G, Bosilca G, Angskun T, Dongarra J, Squyres J, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain R, Daniel D, Graham R, Woodall T (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, pp 97–104

37.

Quintin J., Hasanov K, Lastovetsky A (2013) Hierarchical parallel matrix multiplication on large-scale distributed memory platforms. In: 42nd International conference on parallel processing (ICPP 2013). IEEE, New York, pp 754–762

38.

Kondo M (2012) Report on Exascale Architecture. In: IESP Meeting, Japan

39.

Grid5000. http://www.grid5000.fr

40.

Balaji P, Gupta R, Vishnu A, Beckman P (2011) Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems. Comput Sci R D 26(3–4):247–256CrossRef

41.

Blackford LS, Whaley RC (1998) ScaLAPACK Evaluation and Performance at the DoD MSRCs. Tech. Rep. LAPACK Working Note No. 136, Technical Report UT CS-98-388, University of Tennessee, Knoxville, TN (1998)

Titel: Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms
verfasst von: Khalid Hasanov
Jean-Noël Quintin
Alexey Lastovetsky
Publikationsdatum: 01.11.2015
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 11/2015
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-014-1133-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 11/2015

Scheduling array redistribution with virtual channel support

KL_GA: an application mapping algorithm for mesh-of-tree (MoT) architecture in network-on-chip design

GPU-accelerated parallel algorithms for linear rankSVM

Interval-based performance modeling for the all-pairs-shortest-path problem on GPUs

Comments on two schemes of identity-based user authentication and key agreement for mobile client–server networks

Dynamic replication strategies in data grid systems: a survey

Premium Partner