Skip to main content
Top
Published in: The Journal of Supercomputing 3/2014

01-06-2014

A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD

Authors: Vasilios Kelefouras, Angeliki Kritikakou, Costas Goutis

Published in: The Journal of Supercomputing | Issue 3/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, a new methodology for speeding up Matrix–Matrix Multiplication using Single Instruction Multiple Data unit, at one and more cores having a shared cache, is presented. This methodology achieves higher execution speed than ATLAS state of the art library (speedup from 1.08 up to 3.5), by decreasing the number of instructions (load/store and arithmetic) and the data cache accesses and misses in the memory hierarchy. This is achieved by fully exploiting the software characteristics (e.g. data reuse) and hardware parameters (e.g. data caches sizes and associativities) as one problem and not separately, giving high quality solutions and a smaller search space.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization, CGO ’06. IEEE Computer Society, Washington, DC, USA, pp 295–305. doi:10.1109/CGO.2006.37 Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization, CGO ’06. IEEE Computer Society, Washington, DC, USA, pp 295–305. doi:10.​1109/​CGO.​2006.​37
2.
go back to reference Bacon DF, Graham SL, Oliver SJ (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26:345–420CrossRef Bacon DF, Graham SL, Oliver SJ (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26:345–420CrossRef
3.
go back to reference Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing. ACM SIGARC, Vienna, Austria Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing. ACM SIGARC, Vienna, Austria
4.
go back to reference Bjørstad P, Manne F, Sørevik T, Vajtersic M (1992) Efficient matrix multiplication on simd computers. SIAM J Matrix Anal Appl 13:386–401CrossRefMathSciNet Bjørstad P, Manne F, Sørevik T, Vajtersic M (1992) Efficient matrix multiplication on simd computers. SIAM J Matrix Anal Appl 13:386–401CrossRefMathSciNet
5.
go back to reference Blackford LS, Choi J, Cleary A, D’Azeuedo E, Demmel J, Dhillon I, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user’s guide. Society for industrial and applied mathematics, Philadelphia, PACrossRef Blackford LS, Choi J, Cleary A, D’Azeuedo E, Demmel J, Dhillon I, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user’s guide. Society for industrial and applied mathematics, Philadelphia, PACrossRef
7.
go back to reference Chatterjee S, Lebeck AR, Patnala PK, Thottethodi M (1999) Recursive array layouts and fast parallel matrix multiplication. In: Proceedings of 11th annual ACM symposium on parallel algorithms and architectures, pp 222–231 Chatterjee S, Lebeck AR, Patnala PK, Thottethodi M (1999) Recursive array layouts and fast parallel matrix multiplication. In: Proceedings of 11th annual ACM symposium on parallel algorithms and architectures, pp 222–231
8.
go back to reference Choi J (1998) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr Pract Exp 10(8):655–670 Choi J (1998) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr Pract Exp 10(8):655–670
9.
go back to reference Cooper KD, Subramanian D, Torczon L (2001) Adaptive optimizing compilers for the 21st century. J Supercomput 23:2002 Cooper KD, Subramanian D, Torczon L (2001) Adaptive optimizing compilers for the 21st century. J Supercomput 23:2002
11.
go back to reference Desprez F, Suter F (2004) Impact of mixed-parallelism on parallel implementations of the strassen and winograd matrix multiplication algorithms: Research articles. Concurr Comput Pract Exp 16(8):771–797. doi:10.1002/cpe.v16:8 CrossRef Desprez F, Suter F (2004) Impact of mixed-parallelism on parallel implementations of the strassen and winograd matrix multiplication algorithms: Research articles. Concurr Comput Pract Exp 16(8):771–797. doi:10.​1002/​cpe.​v16:​8 CrossRef
12.
go back to reference Frigo M, Johnson SG (1997) The fastest fourier transform in the west. Tech. rep, Cambridge, MA Frigo M, Johnson SG (1997) The fastest fourier transform in the west. Tech. rep, Cambridge, MA
14.
go back to reference Geijn RAVD, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Tech. rep., Cambridge, MA Geijn RAVD, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Tech. rep., Cambridge, MA
15.
go back to reference Goto K, van de Geijn R (2002) On reducing tlb misses in matrix multiplication. Tech. rep., Cambridge, MA Goto K, van de Geijn R (2002) On reducing tlb misses in matrix multiplication. Tech. rep., Cambridge, MA
17.
go back to reference Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization FDDO Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization FDDO
19.
go back to reference Hall JD, Carr NA, Hart JC (2003) Cache and bandwidth aware matrix multiplication on the gpu. Tech. rep., Cambridge, MA Hall JD, Carr NA, Hart JC (2003) Cache and bandwidth aware matrix multiplication on the gpu. Tech. rep., Cambridge, MA
21.
go back to reference Hunold S, Rauber T (2005) Automatic tuning of pdgemm towards optimal performance. In: Proceedings of the 11th international Euro-Par conference on parallel processing, Euro-Par’05. Springer-Verlag, Berlin, pp 837–846. doi:10.1007/11549468_91 Hunold S, Rauber T (2005) Automatic tuning of pdgemm towards optimal performance. In: Proceedings of the 11th international Euro-Par conference on parallel processing, Euro-Par’05. Springer-Verlag, Berlin, pp 837–846. doi:10.​1007/​11549468_​91
22.
go back to reference Hunold S, Rauber T, Rünger G (2004) Multilevel hierarchical matrix multiplication on clusters. In: Proceedings of the 18th annual international conference on supercomputing, ICS’04. ACM, New York, NY, pp. 136–145. doi:10.1145/1006209.1006230 Hunold S, Rauber T, Rünger G (2004) Multilevel hierarchical matrix multiplication on clusters. In: Proceedings of the 18th annual international conference on supercomputing, ICS’04. ACM, New York, NY, pp. 136–145. doi:10.​1145/​1006209.​1006230
24.
go back to reference Jiang C, Snir M (2005) Automatic tuning matrix multiplication performance on graphics hardware. In: In the proceedings of the 14th international conference on parallel architecture and compilation techniques (PACT), pp 185–196 Jiang C, Snir M (2005) Automatic tuning matrix multiplication performance on graphics hardware. In: In the proceedings of the 14th international conference on parallel architecture and compilation techniques (PACT), pp 185–196
26.
go back to reference KKrishnan M, Nieplocha J (2004) Srumma: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. Parallel and distributed processing symposium, international 1, 70b. doi:10.1109/IPDPS.2004.1303000 KKrishnan M, Nieplocha J (2004) Srumma: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. Parallel and distributed processing symposium, international 1, 70b. doi:10.​1109/​IPDPS.​2004.​1303000
27.
go back to reference Krishnan, M., Nieplocha, J.: Memory efficient parallel matrix multiplication operation for irregular problems. In: Proceedings of the 3rd conference on Computing frontiers, CF ’06, pp. 229–240. ACM, New York, NY, USA (2006). DOI 10.1145/1128022.1128054. URL http://doi.acm.org/10.1145/1128022.1128054 Krishnan, M., Nieplocha, J.: Memory efficient parallel matrix multiplication operation for irregular problems. In: Proceedings of the 3rd conference on Computing frontiers, CF ’06, pp. 229–240. ACM, New York, NY, USA (2006). DOI 10.1145/1128022.1128054. URL http://​doi.​acm.​org/​10.​1145/​1128022.​1128054
28.
go back to reference Krivutsenko A (2008) Gotoblas—anatomy of a fast matrix multiplication. Tech. rep., Cambridge, MA Krivutsenko A (2008) Gotoblas—anatomy of a fast matrix multiplication. Tech. rep., Cambridge, MA
29.
go back to reference Kulkarni M, Pingali K (2008) An experimental study of self-optimizing dense linear algebra software. Proc IEEE 96(5):832–848CrossRef Kulkarni M, Pingali K (2008) An experimental study of self-optimizing dense linear algebra software. Proc IEEE 96(5):832–848CrossRef
31.
go back to reference Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36 Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36
33.
go back to reference Michaud P (2011) Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view. In: Proceedings of the 6th international conference on high performance and embedded architectures and compilers, HiPEAC’11. ACM, New York, NY, pp. 187–196. doi:10.1145/1944862.1944890 Michaud P (2011) Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view. In: Proceedings of the 6th international conference on high performance and embedded architectures and compilers, HiPEAC’11. ACM, New York, NY, pp. 187–196. doi:10.​1145/​1944862.​1944890
35.
go back to reference Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications, AIMSA’02. Springer-Verlag, London, pp 41–50. http://dl.acm.org/citation.cfm?id=646053.677574 Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications, AIMSA’02. Springer-Verlag, London, pp 41–50. http://​dl.​acm.​org/​citation.​cfm?​id=​646053.​677574
36.
go back to reference Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the hilbert space-filling curve. IEEE Trans Knowl Data Eng 13:2001CrossRef Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the hilbert space-filling curve. IEEE Trans Knowl Data Eng 13:2001CrossRef
38.
go back to reference Nikolopoulos DS (2003) Code and data transformations for improving shared cache performance on smt processors. In: ISHPC, pp 54–69 Nikolopoulos DS (2003) Code and data transformations for improving shared cache performance on smt processors. In: ISHPC, pp 54–69
40.
go back to reference Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems, CASES’11. ACM, New York, NY, pp. 65–74. doi:10.1145/2038698.2038711 Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems, CASES’11. ACM, New York, NY, pp. 65–74. doi:10.​1145/​2038698.​2038711
41.
go back to reference Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1):21–38 Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1):21–38
42.
go back to reference Rünger G., Schwind M (2010 Fast recursive matrix multiplication for multi-core architectures. Procedia Comput Sci 1(1):67–76. International conference on computational science 2010 (ICCS 2010) Rünger G., Schwind M (2010 Fast recursive matrix multiplication for multi-core architectures. Procedia Comput Sci 1(1):67–76. International conference on computational science 2010 (ICCS 2010)
43.
go back to reference See homepage for details: Atlas homepage (2012). Http://math-atlas.sourceforge.net/ See homepage for details: Atlas homepage (2012). Http://math-atlas.sourceforge.net/
44.
go back to reference Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi:10.1145/2512432 Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi:10.​1145/​2512432
48.
go back to reference Thottethodi M, Chatterjee S, Lebeck AR (1998) Tuning strassen’s matrix multiplication for memory efficiency. In: In proceedings of SC98 (CD-ROM) Thottethodi M, Chatterjee S, Lebeck AR (1998) Tuning strassen’s matrix multiplication for memory efficiency. In: In proceedings of SC98 (CD-ROM)
49.
go back to reference Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO ’03. IEEE Computer Society, Washington, DC, USA, pp 204–215. http://dl.acm.org/citation.cfm?id=776261.776284 Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO ’03. IEEE Computer Society, Washington, DC, USA, pp 204–215. http://​dl.​acm.​org/​citation.​cfm?​id=​776261.​776284
50.
go back to reference Tsilikas G, Fleury M (2004) Matrix multiplication performance on commodity shared-memory multiprocessors. In: Proceedings of the international conference on parallel computing in electrical engineering, PARELEC ’04. IEEE Computer Society, Washington, DC, USA, pp 13–18. doi:10.1109/PARELEC.2004.43 Tsilikas G, Fleury M (2004) Matrix multiplication performance on commodity shared-memory multiprocessors. In: Proceedings of the international conference on parallel computing in electrical engineering, PARELEC ’04. IEEE Computer Society, Washington, DC, USA, pp 13–18. doi:10.​1109/​PARELEC.​2004.​43
51.
go back to reference Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Tech. Rep. UT-CS-97-366, University of Tennessee Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Tech. Rep. UT-CS-97-366, University of Tennessee
52.
go back to reference Whaley RC, Dongarra J J (1998) Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on supercomputing, Supercomputing ’98. IEEE Computer Society, San Jose, CA, pp 1–27 Whaley RC, Dongarra J J (1998) Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on supercomputing, Supercomputing ’98. IEEE Computer Society, San Jose, CA, pp 1–27
53.
go back to reference Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: Ninth SIAM conference on parallel processing for scientific computing. CD-ROM proceedings Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: Ninth SIAM conference on parallel processing for scientific computing. CD-ROM proceedings
54.
go back to reference Whaley RC, Petitet A (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw Pract Exp 35(2):101–121CrossRef Whaley RC, Petitet A (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw Pract Exp 35(2):101–121CrossRef
55.
go back to reference Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35CrossRefMATH Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35CrossRefMATH
56.
go back to reference Yotov K, Li X, Ren G, Garzaran M, Padua D, Pingali K, Stodghill P (2005) Is search really necessary to generate high-performance blas? Proceedings of the IEEE 93(2) Yotov K, Li X, Ren G, Garzaran M, Padua D, Pingali K, Stodghill P (2005) Is search really necessary to generate high-performance blas? Proceedings of the IEEE 93(2)
57.
go back to reference Yuan N, Zhou Y, Tan G, Zhang J., Fan D (2009) High performance matrix multiplication on many cores. In: Proceedings of the 15th international Euro-Par conference on parallel processing, Euro-Par ’09. Springer-Verlag, Berlin, pp. 948–959. doi:10.1007/978-3-642-03869-3_87 Yuan N, Zhou Y, Tan G, Zhang J., Fan D (2009) High performance matrix multiplication on many cores. In: Proceedings of the 15th international Euro-Par conference on parallel processing, Euro-Par ’09. Springer-Verlag, Berlin, pp. 948–959. doi:10.​1007/​978-3-642-03869-3_​87
58.
go back to reference Zhuravlev S, Saez JC, Fedorova A, Prieto M (2012) Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Comput Surv 45(1):1–28 Zhuravlev S, Saez JC, Fedorova A, Prieto M (2012) Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Comput Surv 45(1):1–28
Metadata
Title
A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD
Authors
Vasilios Kelefouras
Angeliki Kritikakou
Costas Goutis
Publication date
01-06-2014
Publisher
Springer US
Published in
The Journal of Supercomputing / Issue 3/2014
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-014-1098-9

Other articles of this Issue 3/2014

The Journal of Supercomputing 3/2014 Go to the issue

Premium Partner