nach oben

The Journal of Supercomputing

Erschienen in:

01.06.2014

A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD

verfasst von: Vasilios Kelefouras, Angeliki Kritikakou, Costas Goutis

Erschienen in: The Journal of Supercomputing | Ausgabe 3/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this paper, a new methodology for speeding up Matrix–Matrix Multiplication using Single Instruction Multiple Data unit, at one and more cores having a shared cache, is presented. This methodology achieves higher execution speed than ATLAS state of the art library (speedup from 1.08 up to 3.5), by decreasing the number of instructions (load/store and arithmetic) and the data cache accesses and misses in the memory hierarchy. This is achieved by fully exploiting the software characteristics (e.g. data reuse) and hardware parameters (e.g. data caches sizes and associativities) as one problem and not separately, giving high quality solutions and a smaller search space.

Vorheriger Artikel A new method based on PSR and EA-GMDH for host load prediction in cloud computing system

Nächster Artikel A self-adaptive resources selection model through a small-world based heuristic

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization, CGO ’06. IEEE Computer Society, Washington, DC, USA, pp 295–305. doi:10.1109/CGO.2006.37

Bacon DF, Graham SL, Oliver SJ (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26:345–420CrossRef

Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing. ACM SIGARC, Vienna, Austria

Bjørstad P, Manne F, Sørevik T, Vajtersic M (1992) Efficient matrix multiplication on simd computers. SIAM J Matrix Anal Appl 13:386–401CrossRefMathSciNet

Blackford LS, Choi J, Cleary A, D’Azeuedo E, Demmel J, Dhillon I, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user’s guide. Society for industrial and applied mathematics, Philadelphia, PACrossRef

Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. SIGPLAN Not 30(8):207–216. doi:10.1145/209937.209958 CrossRef

Chatterjee S, Lebeck AR, Patnala PK, Thottethodi M (1999) Recursive array layouts and fast parallel matrix multiplication. In: Proceedings of 11th annual ACM symposium on parallel algorithms and architectures, pp 222–231

Choi J (1998) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr Pract Exp 10(8):655–670

Cooper KD, Subramanian D, Torczon L (2001) Adaptive optimizing compilers for the 21st century. J Supercomput 23:2002

10.

Desprez F, Suter F (2002) Impact of mixed-parallelismpon parallel implementations of Strassen and Winograd matrix multiplication algorithms. Rapport de recherche RR-4482, INRIA. http://hal.inria.fr/inria-00072106

11.

Desprez F, Suter F (2004) Impact of mixed-parallelism on parallel implementations of the strassen and winograd matrix multiplication algorithms: Research articles. Concurr Comput Pract Exp 16(8):771–797. doi:10.1002/cpe.v16:8 CrossRef

12.

Frigo M, Johnson SG (1997) The fastest fourier transform in the west. Tech. rep, Cambridge, MA

13.

Garcia E, Venetis IE, Khan R, Gao GR (2010) Optimized dense matrix multiplication on a many-core architecture. In: Proceedings of the 16th international Euro-Par conference on parallel processing: Part II, Euro-Par’10. Springer-Verlag, Berlin, Heidelberg, pp 316–327. http://dl.acm.org/citation.cfm?id=1885276.1885308

14.

Geijn RAVD, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Tech. rep., Cambridge, MA

15.

Goto K, van de Geijn R (2002) On reducing tlb misses in matrix multiplication. Tech. rep., Cambridge, MA

16.

Goto K, van de Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1–12:25. doi:10.1145/1356052.1356053

17.

Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization FDDO

18.

Guennebaud G, Jacob B, et al (2010) Eigen v3. http://eigen.tuxfamily.org

19.

Hall JD, Carr NA, Hart JC (2003) Cache and bandwidth aware matrix multiplication on the gpu. Tech. rep., Cambridge, MA

20.

Hattori M, Ito N, Chen W, Wada K (2005) Parallel matrix-multiplication algorithm for distributed parallel computers. Syst Comput Jpn 36(4):48–59. doi:10.1002/scj.v36:4 CrossRef

21.

Hunold S, Rauber T (2005) Automatic tuning of pdgemm towards optimal performance. In: Proceedings of the 11th international Euro-Par conference on parallel processing, Euro-Par’05. Springer-Verlag, Berlin, pp 837–846. doi:10.1007/11549468_91

22.

Hunold S, Rauber T, Rünger G (2004) Multilevel hierarchical matrix multiplication on clusters. In: Proceedings of the 18th annual international conference on supercomputing, ICS’04. ACM, New York, NY, pp. 136–145. doi:10.1145/1006209.1006230

23.

Intel (2012) Intel mkl. Available at http://software.intel.com/en-us/intel-mkl

24.

Jiang C, Snir M (2005) Automatic tuning matrix multiplication performance on graphics hardware. In: In the proceedings of the 14th international conference on parallel architecture and compilation techniques (PACT), pp 185–196

25.

Kisuki T, Knijnenburg PMW, O’Boyle MFP, Bodin F, Wijshoff HAG (1999) A feasibility study in iterative compilation. In: Proceedings of the 2nd international symposium on high performance computing, ISHPC’99. Springer-Verlag, London, pp. 121–132. http://dl.acm.org/citation.cfm?id=646347.690219

26.

KKrishnan M, Nieplocha J (2004) Srumma: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. Parallel and distributed processing symposium, international 1, 70b. doi:10.1109/IPDPS.2004.1303000

27.

Krishnan, M., Nieplocha, J.: Memory efficient parallel matrix multiplication operation for irregular problems. In: Proceedings of the 3rd conference on Computing frontiers, CF ’06, pp. 229–240. ACM, New York, NY, USA (2006). DOI 10.1145/1128022.1128054. URL http://doi.acm.org/10.1145/1128022.1128054

28.

Krivutsenko A (2008) Gotoblas—anatomy of a fast matrix multiplication. Tech. rep., Cambridge, MA

29.

Kulkarni M, Pingali K (2008) An experimental study of self-optimizing dense linear algebra software. Proc IEEE 96(5):832–848CrossRef

30.

Kulkarni P, Hines S, Hiser J, Whalley D, Davidson J, Jones D (2004) Fast searches for effective optimization phase sequences. SIGPLAN Not 39(6):171–182. doi:10.1145/996893.996863 CrossRef

31.

Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36

32.

Kurzak J, Alvaro W, Dongarra J (2009) Optimizing matrix multiplication for a short-vector simd architecture—cell processor. Parallel Comput 35(3):138–150. doi:10.1016/j.parco.2008.12.010 CrossRef

33.

Michaud P (2011) Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view. In: Proceedings of the 6th international conference on high performance and embedded architectures and compilers, HiPEAC’11. ACM, New York, NY, pp. 187–196. doi:10.1145/1944862.1944890

34.

Milder PA, Franchetti F, Hoe JC, Püschel M (2012) Computer generation of hardware for linear digital signal processing transforms. ACM Trans Des Autom Electron Syst 17(2). http://dblp.uni-trier.de/db/journals/todaes/todaes17.html#MilderFHP12

35.

Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications, AIMSA’02. Springer-Verlag, London, pp 41–50. http://dl.acm.org/citation.cfm?id=646053.677574

36.

Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the hilbert space-filling curve. IEEE Trans Knowl Data Eng 13:2001CrossRef

37.

Nethercote N, Seward J (20007) Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not 42(6):89–100. doi:10.1145/1273442.1250746 CrossRef

38.

Nikolopoulos DS (2003) Code and data transformations for improving shared cache performance on smt processors. In: ISHPC, pp 54–69

39.

Openblas (2012) An optimized blas library. URL available at http://xianyi.github.com/OpenBLAS/

40.

Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems, CASES’11. ACM, New York, NY, pp. 65–74. doi:10.1145/2038698.2038711

41.

Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1):21–38

42.

Rünger G., Schwind M (2010 Fast recursive matrix multiplication for multi-core architectures. Procedia Comput Sci 1(1):67–76. International conference on computational science 2010 (ICCS 2010)

43.

See homepage for details: Atlas homepage (2012). Http://math-atlas.sourceforge.net/

44.

Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi:10.1145/2512432

45.

Stephenson M, Amarasinghe S, Martin M, O’Reilly UM (2003) Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not 38(5):77–90. doi:10.1145/780822.781141 CrossRef

46.

Strassen V (1969) Gaussian elimination is not optimal. Numerische Mathematik 14(3):354–356CrossRefMathSciNet

47.

Tartara M, Crespi Reghizzi S (2013) Continuous learning of compiler heuristics. ACM Trans Archit Code Optim 9(4):46:1–46:25. doi:10.1145/2400682.2400705 CrossRef

48.

Thottethodi M, Chatterjee S, Lebeck AR (1998) Tuning strassen’s matrix multiplication for memory efficiency. In: In proceedings of SC98 (CD-ROM)

49.

Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO ’03. IEEE Computer Society, Washington, DC, USA, pp 204–215. http://dl.acm.org/citation.cfm?id=776261.776284

50.

Tsilikas G, Fleury M (2004) Matrix multiplication performance on commodity shared-memory multiprocessors. In: Proceedings of the international conference on parallel computing in electrical engineering, PARELEC ’04. IEEE Computer Society, Washington, DC, USA, pp 13–18. doi:10.1109/PARELEC.2004.43

51.

Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Tech. Rep. UT-CS-97-366, University of Tennessee

52.

Whaley RC, Dongarra J J (1998) Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on supercomputing, Supercomputing ’98. IEEE Computer Society, San Jose, CA, pp 1–27

53.

Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: Ninth SIAM conference on parallel processing for scientific computing. CD-ROM proceedings

54.

Whaley RC, Petitet A (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw Pract Exp 35(2):101–121CrossRef

55.

Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35CrossRefMATH

56.

Yotov K, Li X, Ren G, Garzaran M, Padua D, Pingali K, Stodghill P (2005) Is search really necessary to generate high-performance blas? Proceedings of the IEEE 93(2)

57.

Yuan N, Zhou Y, Tan G, Zhang J., Fan D (2009) High performance matrix multiplication on many cores. In: Proceedings of the 15th international Euro-Par conference on parallel processing, Euro-Par ’09. Springer-Verlag, Berlin, pp. 948–959. doi:10.1007/978-3-642-03869-3_87

58.

Zhuravlev S, Saez JC, Fedorova A, Prieto M (2012) Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Comput Surv 45(1):1–28

Titel: A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD
verfasst von: Vasilios Kelefouras
Angeliki Kritikakou
Costas Goutis
Publikationsdatum: 01.06.2014
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 3/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-014-1098-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 3/2014

Weighted adaptive concurrency control for software transactional memory

An efficient grid scheduling strategy for data parallel applications

An analytical study of resource division and its impact on power and performance of multi-core processors

The BonaFide C Analyzer: automatic loop-level characterization and coverage measurement

Cache vulnerability mitigation using an adaptive cache coherence protocol

Adaptive prefetching using global history buffer in multicore processors

Premium Partner