Top

The Journal of Supercomputing

Published in:

01-06-2014

A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD

Authors: Vasilios Kelefouras, Angeliki Kritikakou, Costas Goutis

Published in: The Journal of Supercomputing | Issue 3/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In this paper, a new methodology for speeding up Matrix–Matrix Multiplication using Single Instruction Multiple Data unit, at one and more cores having a shared cache, is presented. This methodology achieves higher execution speed than ATLAS state of the art library (speedup from 1.08 up to 3.5), by decreasing the number of instructions (load/store and arithmetic) and the data cache accesses and misses in the memory hierarchy. This is achieved by fully exploiting the software characteristics (e.g. data reuse) and hardware parameters (e.g. data caches sizes and associativities) as one problem and not separately, giving high quality solutions and a smaller search space.

previous article A new method based on PSR and EA-GMDH for host load prediction in cloud computing system

next article A self-adaptive resources selection model through a small-world based heuristic

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization, CGO ’06. IEEE Computer Society, Washington, DC, USA, pp 295–305. doi:10.1109/CGO.2006.37

Bacon DF, Graham SL, Oliver SJ (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26:345–420CrossRef

Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing. ACM SIGARC, Vienna, Austria

Bjørstad P, Manne F, Sørevik T, Vajtersic M (1992) Efficient matrix multiplication on simd computers. SIAM J Matrix Anal Appl 13:386–401CrossRefMathSciNet

Blackford LS, Choi J, Cleary A, D’Azeuedo E, Demmel J, Dhillon I, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user’s guide. Society for industrial and applied mathematics, Philadelphia, PACrossRef

Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. SIGPLAN Not 30(8):207–216. doi:10.1145/209937.209958 CrossRef

Chatterjee S, Lebeck AR, Patnala PK, Thottethodi M (1999) Recursive array layouts and fast parallel matrix multiplication. In: Proceedings of 11th annual ACM symposium on parallel algorithms and architectures, pp 222–231

Choi J (1998) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr Pract Exp 10(8):655–670

Cooper KD, Subramanian D, Torczon L (2001) Adaptive optimizing compilers for the 21st century. J Supercomput 23:2002

10.

Desprez F, Suter F (2002) Impact of mixed-parallelismpon parallel implementations of Strassen and Winograd matrix multiplication algorithms. Rapport de recherche RR-4482, INRIA. http://hal.inria.fr/inria-00072106

11.

Desprez F, Suter F (2004) Impact of mixed-parallelism on parallel implementations of the strassen and winograd matrix multiplication algorithms: Research articles. Concurr Comput Pract Exp 16(8):771–797. doi:10.1002/cpe.v16:8 CrossRef

12.

Frigo M, Johnson SG (1997) The fastest fourier transform in the west. Tech. rep, Cambridge, MA

13.

Garcia E, Venetis IE, Khan R, Gao GR (2010) Optimized dense matrix multiplication on a many-core architecture. In: Proceedings of the 16th international Euro-Par conference on parallel processing: Part II, Euro-Par’10. Springer-Verlag, Berlin, Heidelberg, pp 316–327. http://dl.acm.org/citation.cfm?id=1885276.1885308

14.

Geijn RAVD, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Tech. rep., Cambridge, MA

15.

Goto K, van de Geijn R (2002) On reducing tlb misses in matrix multiplication. Tech. rep., Cambridge, MA

16.

Goto K, van de Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1–12:25. doi:10.1145/1356052.1356053

17.

Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization FDDO

18.

Guennebaud G, Jacob B, et al (2010) Eigen v3. http://eigen.tuxfamily.org

19.

Hall JD, Carr NA, Hart JC (2003) Cache and bandwidth aware matrix multiplication on the gpu. Tech. rep., Cambridge, MA

20.

Hattori M, Ito N, Chen W, Wada K (2005) Parallel matrix-multiplication algorithm for distributed parallel computers. Syst Comput Jpn 36(4):48–59. doi:10.1002/scj.v36:4 CrossRef

21.

Hunold S, Rauber T (2005) Automatic tuning of pdgemm towards optimal performance. In: Proceedings of the 11th international Euro-Par conference on parallel processing, Euro-Par’05. Springer-Verlag, Berlin, pp 837–846. doi:10.1007/11549468_91

22.

Hunold S, Rauber T, Rünger G (2004) Multilevel hierarchical matrix multiplication on clusters. In: Proceedings of the 18th annual international conference on supercomputing, ICS’04. ACM, New York, NY, pp. 136–145. doi:10.1145/1006209.1006230

23.

Intel (2012) Intel mkl. Available at http://software.intel.com/en-us/intel-mkl

24.

Jiang C, Snir M (2005) Automatic tuning matrix multiplication performance on graphics hardware. In: In the proceedings of the 14th international conference on parallel architecture and compilation techniques (PACT), pp 185–196

25.

Kisuki T, Knijnenburg PMW, O’Boyle MFP, Bodin F, Wijshoff HAG (1999) A feasibility study in iterative compilation. In: Proceedings of the 2nd international symposium on high performance computing, ISHPC’99. Springer-Verlag, London, pp. 121–132. http://dl.acm.org/citation.cfm?id=646347.690219

26.

KKrishnan M, Nieplocha J (2004) Srumma: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. Parallel and distributed processing symposium, international 1, 70b. doi:10.1109/IPDPS.2004.1303000

27.

Krishnan, M., Nieplocha, J.: Memory efficient parallel matrix multiplication operation for irregular problems. In: Proceedings of the 3rd conference on Computing frontiers, CF ’06, pp. 229–240. ACM, New York, NY, USA (2006). DOI 10.1145/1128022.1128054. URL http://doi.acm.org/10.1145/1128022.1128054

28.

Krivutsenko A (2008) Gotoblas—anatomy of a fast matrix multiplication. Tech. rep., Cambridge, MA

29.

Kulkarni M, Pingali K (2008) An experimental study of self-optimizing dense linear algebra software. Proc IEEE 96(5):832–848CrossRef

30.

Kulkarni P, Hines S, Hiser J, Whalley D, Davidson J, Jones D (2004) Fast searches for effective optimization phase sequences. SIGPLAN Not 39(6):171–182. doi:10.1145/996893.996863 CrossRef

31.

Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36

32.

Kurzak J, Alvaro W, Dongarra J (2009) Optimizing matrix multiplication for a short-vector simd architecture—cell processor. Parallel Comput 35(3):138–150. doi:10.1016/j.parco.2008.12.010 CrossRef

33.

Michaud P (2011) Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view. In: Proceedings of the 6th international conference on high performance and embedded architectures and compilers, HiPEAC’11. ACM, New York, NY, pp. 187–196. doi:10.1145/1944862.1944890

34.

Milder PA, Franchetti F, Hoe JC, Püschel M (2012) Computer generation of hardware for linear digital signal processing transforms. ACM Trans Des Autom Electron Syst 17(2). http://dblp.uni-trier.de/db/journals/todaes/todaes17.html#MilderFHP12

35.

Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications, AIMSA’02. Springer-Verlag, London, pp 41–50. http://dl.acm.org/citation.cfm?id=646053.677574

36.

Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the hilbert space-filling curve. IEEE Trans Knowl Data Eng 13:2001CrossRef

37.

Nethercote N, Seward J (20007) Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not 42(6):89–100. doi:10.1145/1273442.1250746 CrossRef

38.

Nikolopoulos DS (2003) Code and data transformations for improving shared cache performance on smt processors. In: ISHPC, pp 54–69

39.

Openblas (2012) An optimized blas library. URL available at http://xianyi.github.com/OpenBLAS/

40.

Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems, CASES’11. ACM, New York, NY, pp. 65–74. doi:10.1145/2038698.2038711

41.

Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1):21–38

42.

Rünger G., Schwind M (2010 Fast recursive matrix multiplication for multi-core architectures. Procedia Comput Sci 1(1):67–76. International conference on computational science 2010 (ICCS 2010)

43.

See homepage for details: Atlas homepage (2012). Http://math-atlas.sourceforge.net/

44.

Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi:10.1145/2512432

45.

Stephenson M, Amarasinghe S, Martin M, O’Reilly UM (2003) Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not 38(5):77–90. doi:10.1145/780822.781141 CrossRef

46.

Strassen V (1969) Gaussian elimination is not optimal. Numerische Mathematik 14(3):354–356CrossRefMathSciNet

47.

Tartara M, Crespi Reghizzi S (2013) Continuous learning of compiler heuristics. ACM Trans Archit Code Optim 9(4):46:1–46:25. doi:10.1145/2400682.2400705 CrossRef

48.

Thottethodi M, Chatterjee S, Lebeck AR (1998) Tuning strassen’s matrix multiplication for memory efficiency. In: In proceedings of SC98 (CD-ROM)

49.

Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO ’03. IEEE Computer Society, Washington, DC, USA, pp 204–215. http://dl.acm.org/citation.cfm?id=776261.776284

50.

Tsilikas G, Fleury M (2004) Matrix multiplication performance on commodity shared-memory multiprocessors. In: Proceedings of the international conference on parallel computing in electrical engineering, PARELEC ’04. IEEE Computer Society, Washington, DC, USA, pp 13–18. doi:10.1109/PARELEC.2004.43

51.

Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Tech. Rep. UT-CS-97-366, University of Tennessee

52.

Whaley RC, Dongarra J J (1998) Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on supercomputing, Supercomputing ’98. IEEE Computer Society, San Jose, CA, pp 1–27

53.

Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: Ninth SIAM conference on parallel processing for scientific computing. CD-ROM proceedings

54.

Whaley RC, Petitet A (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw Pract Exp 35(2):101–121CrossRef

55.

Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35CrossRefMATH

56.

Yotov K, Li X, Ren G, Garzaran M, Padua D, Pingali K, Stodghill P (2005) Is search really necessary to generate high-performance blas? Proceedings of the IEEE 93(2)

57.

Yuan N, Zhou Y, Tan G, Zhang J., Fan D (2009) High performance matrix multiplication on many cores. In: Proceedings of the 15th international Euro-Par conference on parallel processing, Euro-Par ’09. Springer-Verlag, Berlin, pp. 948–959. doi:10.1007/978-3-642-03869-3_87

58.

Zhuravlev S, Saez JC, Fedorova A, Prieto M (2012) Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Comput Surv 45(1):1–28

Title: A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD
Authors: Vasilios Kelefouras
Angeliki Kritikakou
Costas Goutis
Publication date: 01-06-2014
Publisher: Springer US
Published in: The Journal of Supercomputing / Issue 3/2014
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-014-1098-9

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 3/2014

Weighted adaptive concurrency control for software transactional memory

Cost optimal scheduling in IaaS for dependent workload with particle swarm optimization

Buffering in proxy mobile IPv6: implementation and analysis

Scheduling parallel jobs on multicore clusters using CPU oversubscription

A task migration mechanism for distributed many-core operating systems

A self-adaptive resources selection model through a small-world based heuristic

Premium Partner