Skip to main content
Erschienen in: The Journal of Supercomputing 3/2014

01.06.2014

A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD

verfasst von: Vasilios Kelefouras, Angeliki Kritikakou, Costas Goutis

Erschienen in: The Journal of Supercomputing | Ausgabe 3/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, a new methodology for speeding up Matrix–Matrix Multiplication using Single Instruction Multiple Data unit, at one and more cores having a shared cache, is presented. This methodology achieves higher execution speed than ATLAS state of the art library (speedup from 1.08 up to 3.5), by decreasing the number of instructions (load/store and arithmetic) and the data cache accesses and misses in the memory hierarchy. This is achieved by fully exploiting the software characteristics (e.g. data reuse) and hardware parameters (e.g. data caches sizes and associativities) as one problem and not separately, giving high quality solutions and a smaller search space.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization, CGO ’06. IEEE Computer Society, Washington, DC, USA, pp 295–305. doi:10.1109/CGO.2006.37 Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization, CGO ’06. IEEE Computer Society, Washington, DC, USA, pp 295–305. doi:10.​1109/​CGO.​2006.​37
2.
Zurück zum Zitat Bacon DF, Graham SL, Oliver SJ (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26:345–420CrossRef Bacon DF, Graham SL, Oliver SJ (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26:345–420CrossRef
3.
Zurück zum Zitat Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing. ACM SIGARC, Vienna, Austria Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing. ACM SIGARC, Vienna, Austria
4.
Zurück zum Zitat Bjørstad P, Manne F, Sørevik T, Vajtersic M (1992) Efficient matrix multiplication on simd computers. SIAM J Matrix Anal Appl 13:386–401CrossRefMathSciNet Bjørstad P, Manne F, Sørevik T, Vajtersic M (1992) Efficient matrix multiplication on simd computers. SIAM J Matrix Anal Appl 13:386–401CrossRefMathSciNet
5.
Zurück zum Zitat Blackford LS, Choi J, Cleary A, D’Azeuedo E, Demmel J, Dhillon I, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user’s guide. Society for industrial and applied mathematics, Philadelphia, PACrossRef Blackford LS, Choi J, Cleary A, D’Azeuedo E, Demmel J, Dhillon I, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user’s guide. Society for industrial and applied mathematics, Philadelphia, PACrossRef
7.
Zurück zum Zitat Chatterjee S, Lebeck AR, Patnala PK, Thottethodi M (1999) Recursive array layouts and fast parallel matrix multiplication. In: Proceedings of 11th annual ACM symposium on parallel algorithms and architectures, pp 222–231 Chatterjee S, Lebeck AR, Patnala PK, Thottethodi M (1999) Recursive array layouts and fast parallel matrix multiplication. In: Proceedings of 11th annual ACM symposium on parallel algorithms and architectures, pp 222–231
8.
Zurück zum Zitat Choi J (1998) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr Pract Exp 10(8):655–670 Choi J (1998) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr Pract Exp 10(8):655–670
9.
Zurück zum Zitat Cooper KD, Subramanian D, Torczon L (2001) Adaptive optimizing compilers for the 21st century. J Supercomput 23:2002 Cooper KD, Subramanian D, Torczon L (2001) Adaptive optimizing compilers for the 21st century. J Supercomput 23:2002
11.
Zurück zum Zitat Desprez F, Suter F (2004) Impact of mixed-parallelism on parallel implementations of the strassen and winograd matrix multiplication algorithms: Research articles. Concurr Comput Pract Exp 16(8):771–797. doi:10.1002/cpe.v16:8 CrossRef Desprez F, Suter F (2004) Impact of mixed-parallelism on parallel implementations of the strassen and winograd matrix multiplication algorithms: Research articles. Concurr Comput Pract Exp 16(8):771–797. doi:10.​1002/​cpe.​v16:​8 CrossRef
12.
Zurück zum Zitat Frigo M, Johnson SG (1997) The fastest fourier transform in the west. Tech. rep, Cambridge, MA Frigo M, Johnson SG (1997) The fastest fourier transform in the west. Tech. rep, Cambridge, MA
13.
14.
Zurück zum Zitat Geijn RAVD, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Tech. rep., Cambridge, MA Geijn RAVD, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Tech. rep., Cambridge, MA
15.
Zurück zum Zitat Goto K, van de Geijn R (2002) On reducing tlb misses in matrix multiplication. Tech. rep., Cambridge, MA Goto K, van de Geijn R (2002) On reducing tlb misses in matrix multiplication. Tech. rep., Cambridge, MA
17.
Zurück zum Zitat Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization FDDO Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization FDDO
19.
Zurück zum Zitat Hall JD, Carr NA, Hart JC (2003) Cache and bandwidth aware matrix multiplication on the gpu. Tech. rep., Cambridge, MA Hall JD, Carr NA, Hart JC (2003) Cache and bandwidth aware matrix multiplication on the gpu. Tech. rep., Cambridge, MA
20.
21.
Zurück zum Zitat Hunold S, Rauber T (2005) Automatic tuning of pdgemm towards optimal performance. In: Proceedings of the 11th international Euro-Par conference on parallel processing, Euro-Par’05. Springer-Verlag, Berlin, pp 837–846. doi:10.1007/11549468_91 Hunold S, Rauber T (2005) Automatic tuning of pdgemm towards optimal performance. In: Proceedings of the 11th international Euro-Par conference on parallel processing, Euro-Par’05. Springer-Verlag, Berlin, pp 837–846. doi:10.​1007/​11549468_​91
22.
Zurück zum Zitat Hunold S, Rauber T, Rünger G (2004) Multilevel hierarchical matrix multiplication on clusters. In: Proceedings of the 18th annual international conference on supercomputing, ICS’04. ACM, New York, NY, pp. 136–145. doi:10.1145/1006209.1006230 Hunold S, Rauber T, Rünger G (2004) Multilevel hierarchical matrix multiplication on clusters. In: Proceedings of the 18th annual international conference on supercomputing, ICS’04. ACM, New York, NY, pp. 136–145. doi:10.​1145/​1006209.​1006230
24.
Zurück zum Zitat Jiang C, Snir M (2005) Automatic tuning matrix multiplication performance on graphics hardware. In: In the proceedings of the 14th international conference on parallel architecture and compilation techniques (PACT), pp 185–196 Jiang C, Snir M (2005) Automatic tuning matrix multiplication performance on graphics hardware. In: In the proceedings of the 14th international conference on parallel architecture and compilation techniques (PACT), pp 185–196
26.
Zurück zum Zitat KKrishnan M, Nieplocha J (2004) Srumma: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. Parallel and distributed processing symposium, international 1, 70b. doi:10.1109/IPDPS.2004.1303000 KKrishnan M, Nieplocha J (2004) Srumma: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. Parallel and distributed processing symposium, international 1, 70b. doi:10.​1109/​IPDPS.​2004.​1303000
27.
Zurück zum Zitat Krishnan, M., Nieplocha, J.: Memory efficient parallel matrix multiplication operation for irregular problems. In: Proceedings of the 3rd conference on Computing frontiers, CF ’06, pp. 229–240. ACM, New York, NY, USA (2006). DOI 10.1145/1128022.1128054. URL http://doi.acm.org/10.1145/1128022.1128054 Krishnan, M., Nieplocha, J.: Memory efficient parallel matrix multiplication operation for irregular problems. In: Proceedings of the 3rd conference on Computing frontiers, CF ’06, pp. 229–240. ACM, New York, NY, USA (2006). DOI 10.1145/1128022.1128054. URL http://​doi.​acm.​org/​10.​1145/​1128022.​1128054
28.
Zurück zum Zitat Krivutsenko A (2008) Gotoblas—anatomy of a fast matrix multiplication. Tech. rep., Cambridge, MA Krivutsenko A (2008) Gotoblas—anatomy of a fast matrix multiplication. Tech. rep., Cambridge, MA
29.
Zurück zum Zitat Kulkarni M, Pingali K (2008) An experimental study of self-optimizing dense linear algebra software. Proc IEEE 96(5):832–848CrossRef Kulkarni M, Pingali K (2008) An experimental study of self-optimizing dense linear algebra software. Proc IEEE 96(5):832–848CrossRef
30.
31.
Zurück zum Zitat Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36 Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36
33.
Zurück zum Zitat Michaud P (2011) Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view. In: Proceedings of the 6th international conference on high performance and embedded architectures and compilers, HiPEAC’11. ACM, New York, NY, pp. 187–196. doi:10.1145/1944862.1944890 Michaud P (2011) Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view. In: Proceedings of the 6th international conference on high performance and embedded architectures and compilers, HiPEAC’11. ACM, New York, NY, pp. 187–196. doi:10.​1145/​1944862.​1944890
35.
Zurück zum Zitat Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications, AIMSA’02. Springer-Verlag, London, pp 41–50. http://dl.acm.org/citation.cfm?id=646053.677574 Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications, AIMSA’02. Springer-Verlag, London, pp 41–50. http://​dl.​acm.​org/​citation.​cfm?​id=​646053.​677574
36.
Zurück zum Zitat Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the hilbert space-filling curve. IEEE Trans Knowl Data Eng 13:2001CrossRef Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the hilbert space-filling curve. IEEE Trans Knowl Data Eng 13:2001CrossRef
38.
Zurück zum Zitat Nikolopoulos DS (2003) Code and data transformations for improving shared cache performance on smt processors. In: ISHPC, pp 54–69 Nikolopoulos DS (2003) Code and data transformations for improving shared cache performance on smt processors. In: ISHPC, pp 54–69
40.
Zurück zum Zitat Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems, CASES’11. ACM, New York, NY, pp. 65–74. doi:10.1145/2038698.2038711 Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems, CASES’11. ACM, New York, NY, pp. 65–74. doi:10.​1145/​2038698.​2038711
41.
Zurück zum Zitat Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1):21–38 Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1):21–38
42.
Zurück zum Zitat Rünger G., Schwind M (2010 Fast recursive matrix multiplication for multi-core architectures. Procedia Comput Sci 1(1):67–76. International conference on computational science 2010 (ICCS 2010) Rünger G., Schwind M (2010 Fast recursive matrix multiplication for multi-core architectures. Procedia Comput Sci 1(1):67–76. International conference on computational science 2010 (ICCS 2010)
43.
Zurück zum Zitat See homepage for details: Atlas homepage (2012). Http://math-atlas.sourceforge.net/ See homepage for details: Atlas homepage (2012). Http://math-atlas.sourceforge.net/
44.
Zurück zum Zitat Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi:10.1145/2512432 Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi:10.​1145/​2512432
45.
48.
Zurück zum Zitat Thottethodi M, Chatterjee S, Lebeck AR (1998) Tuning strassen’s matrix multiplication for memory efficiency. In: In proceedings of SC98 (CD-ROM) Thottethodi M, Chatterjee S, Lebeck AR (1998) Tuning strassen’s matrix multiplication for memory efficiency. In: In proceedings of SC98 (CD-ROM)
49.
Zurück zum Zitat Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO ’03. IEEE Computer Society, Washington, DC, USA, pp 204–215. http://dl.acm.org/citation.cfm?id=776261.776284 Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO ’03. IEEE Computer Society, Washington, DC, USA, pp 204–215. http://​dl.​acm.​org/​citation.​cfm?​id=​776261.​776284
50.
Zurück zum Zitat Tsilikas G, Fleury M (2004) Matrix multiplication performance on commodity shared-memory multiprocessors. In: Proceedings of the international conference on parallel computing in electrical engineering, PARELEC ’04. IEEE Computer Society, Washington, DC, USA, pp 13–18. doi:10.1109/PARELEC.2004.43 Tsilikas G, Fleury M (2004) Matrix multiplication performance on commodity shared-memory multiprocessors. In: Proceedings of the international conference on parallel computing in electrical engineering, PARELEC ’04. IEEE Computer Society, Washington, DC, USA, pp 13–18. doi:10.​1109/​PARELEC.​2004.​43
51.
Zurück zum Zitat Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Tech. Rep. UT-CS-97-366, University of Tennessee Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Tech. Rep. UT-CS-97-366, University of Tennessee
52.
Zurück zum Zitat Whaley RC, Dongarra J J (1998) Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on supercomputing, Supercomputing ’98. IEEE Computer Society, San Jose, CA, pp 1–27 Whaley RC, Dongarra J J (1998) Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on supercomputing, Supercomputing ’98. IEEE Computer Society, San Jose, CA, pp 1–27
53.
Zurück zum Zitat Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: Ninth SIAM conference on parallel processing for scientific computing. CD-ROM proceedings Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: Ninth SIAM conference on parallel processing for scientific computing. CD-ROM proceedings
54.
Zurück zum Zitat Whaley RC, Petitet A (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw Pract Exp 35(2):101–121CrossRef Whaley RC, Petitet A (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw Pract Exp 35(2):101–121CrossRef
55.
Zurück zum Zitat Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35CrossRefMATH Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35CrossRefMATH
56.
Zurück zum Zitat Yotov K, Li X, Ren G, Garzaran M, Padua D, Pingali K, Stodghill P (2005) Is search really necessary to generate high-performance blas? Proceedings of the IEEE 93(2) Yotov K, Li X, Ren G, Garzaran M, Padua D, Pingali K, Stodghill P (2005) Is search really necessary to generate high-performance blas? Proceedings of the IEEE 93(2)
57.
Zurück zum Zitat Yuan N, Zhou Y, Tan G, Zhang J., Fan D (2009) High performance matrix multiplication on many cores. In: Proceedings of the 15th international Euro-Par conference on parallel processing, Euro-Par ’09. Springer-Verlag, Berlin, pp. 948–959. doi:10.1007/978-3-642-03869-3_87 Yuan N, Zhou Y, Tan G, Zhang J., Fan D (2009) High performance matrix multiplication on many cores. In: Proceedings of the 15th international Euro-Par conference on parallel processing, Euro-Par ’09. Springer-Verlag, Berlin, pp. 948–959. doi:10.​1007/​978-3-642-03869-3_​87
58.
Zurück zum Zitat Zhuravlev S, Saez JC, Fedorova A, Prieto M (2012) Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Comput Surv 45(1):1–28 Zhuravlev S, Saez JC, Fedorova A, Prieto M (2012) Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Comput Surv 45(1):1–28
Metadaten
Titel
A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD
verfasst von
Vasilios Kelefouras
Angeliki Kritikakou
Costas Goutis
Publikationsdatum
01.06.2014
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 3/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-014-1098-9

Weitere Artikel der Ausgabe 3/2014

The Journal of Supercomputing 3/2014 Zur Ausgabe

Premium Partner