Skip to main content
Erschienen in: The Journal of Supercomputing 7/2015

01.07.2015

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

verfasst von: Vasilios Kelefouras, Angeliki Kritikakou, Elissavet Papadima, Costas Goutis

Erschienen in: The Journal of Supercomputing | Ausgabe 7/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, a new methodology for computing the Dense Matrix Vector Multiplication, for both embedded (processors without SIMD unit) and general purpose processors (single and multi-core processors, with SIMD unit), is presented. This methodology achieves higher execution speed than ATLAS state-of-the-art library (speedup from 1.2 up to 1.45). This is achieved by fully exploiting the combination of the software (e.g., data reuse) and hardware parameters (e.g., data cache associativity) which are considered simultaneously as one problem and not separately, giving a smaller search space and high-quality solutions. The proposed methodology produces a different schedule for different values of the (i) number of the levels of data cache; (ii) data cache sizes; (iii) data cache associativities; (iv) data cache and main memory latencies; (v) data array layout of the matrix and (vi) number of cores.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Whaley RC, Petitet A (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw: Pract Exp 35(2):101–121 Whaley RC, Petitet A (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw: Pract Exp 35(2):101–121
3.
Zurück zum Zitat Krivutsenko A (2008) GotoBLAS—anatomy of a fast matrix multiplication. Technical report Krivutsenko A (2008) GotoBLAS—anatomy of a fast matrix multiplication. Technical report
6.
Zurück zum Zitat Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing. ACM SIGARC, Vienna, Austria Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing. ACM SIGARC, Vienna, Austria
7.
Zurück zum Zitat Frigo M, Johnson SG (1997) The fastest Fourier transform in the west. Technical report. Cambridge, MA, USA Frigo M, Johnson SG (1997) The fastest Fourier transform in the west. Technical report. Cambridge, MA, USA
8.
Zurück zum Zitat Milder P, Franchetti F, Hoe JC, Püschel M (2012) Computer generation of hardware for linear digital signal processing transforms. ACM Trans Des Autom Electron Syst 17(2) 15:1–15:33. doi:10.1145/2159542.2159547 Milder P, Franchetti F, Hoe JC, Püschel M (2012) Computer generation of hardware for linear digital signal processing transforms. ACM Trans Des Autom Electron Syst 17(2) 15:1–15:33. doi:10.​1145/​2159542.​2159547
9.
Zurück zum Zitat Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1):21–38 Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1):21–38
10.
Zurück zum Zitat Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi:10.1145/2512432 Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi:10.​1145/​2512432
11.
12.
Zurück zum Zitat Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization (FDDO) Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization (FDDO)
13.
Zurück zum Zitat Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on code generation and optimization: feedback-directed and runtime optimization, CGO ’03, pp 204–215. IEEE Computer Society, Washington, DC, USA. http://dl.acm.org/citation.cfm?id=776261.776284 Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on code generation and optimization: feedback-directed and runtime optimization, CGO ’03, pp 204–215. IEEE Computer Society, Washington, DC, USA. http://​dl.​acm.​org/​citation.​cfm?​id=​776261.​776284
16.
Zurück zum Zitat Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36. doi:10.1145/1509864.1509865 Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36. doi:10.​1145/​1509864.​1509865
17.
18.
Zurück zum Zitat Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems, CASES ’11, pp 65–74. ACM, New York, NY, USA. doi:10.1145/2038698.2038711 Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems, CASES ’11, pp 65–74. ACM, New York, NY, USA. doi:10.​1145/​2038698.​2038711
19.
Zurück zum Zitat Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications, AIMSA ’02, pp 41–50. Springer-Verlag, London, UK. http://dl.acm.org/citation.cfm?id=646053.677574 Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications, AIMSA ’02, pp 41–50. Springer-Verlag, London, UK. http://​dl.​acm.​org/​citation.​cfm?​id=​646053.​677574
20.
Zurück zum Zitat Stephenson M, Amarasinghe S, Martin M, O’Reilly UM (2003) Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not 38(5):77–90 (2003). doi:10.1145/780822.781141 Stephenson M, Amarasinghe S, Martin M, O’Reilly UM (2003) Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not 38(5):77–90 (2003). doi:10.​1145/​780822.​781141
22.
Zurück zum Zitat Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization, CGO ’06, pp 295–305. IEEE Computer Society, Washington, DC, USA. doi:10.1109/CGO.2006.37 Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization, CGO ’06, pp 295–305. IEEE Computer Society, Washington, DC, USA. doi:10.​1109/​CGO.​2006.​37
24.
Zurück zum Zitat Simplescalar CI, Burger D, Austin TM (1997) The SimpleScalar tool set, version 2.0. Technical report Simplescalar CI, Burger D, Austin TM (1997) The SimpleScalar tool set, version 2.0. Technical report
25.
Zurück zum Zitat Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35MATHCrossRef Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35MATHCrossRef
26.
Zurück zum Zitat Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: 9th SIAM conference on parallel processing for scientific computing. CD-ROM Proceedings Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: 9th SIAM conference on parallel processing for scientific computing. CD-ROM Proceedings
27.
Zurück zum Zitat Whaley RC, Dongarra J (1998) Automatically tuned linear algebra software. In: SuperComputing 1998: high performance networking and computing Whaley RC, Dongarra J (1998) Automatically tuned linear algebra software. In: SuperComputing 1998: high performance networking and computing
28.
Zurück zum Zitat Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Technical report. UT-CS-97-366, University of Tennessee Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Technical report. UT-CS-97-366, University of Tennessee
30.
Zurück zum Zitat Fujimoto N (2008) Dense matrix–vector multiplication on the CUDA architecture. Parallel Process Lett 18(4):511–530MathSciNetCrossRef Fujimoto N (2008) Dense matrix–vector multiplication on the CUDA architecture. Parallel Process Lett 18(4):511–530MathSciNetCrossRef
32.
Zurück zum Zitat Hendrickson B, Leland R, Plimpton S (1995) An efficient parallel algorithm for matrix–vector multiplication. Int J High Speed Comput 7:73–88CrossRef Hendrickson B, Leland R, Plimpton S (1995) An efficient parallel algorithm for matrix–vector multiplication. Int J High Speed Comput 7:73–88CrossRef
33.
Zurück zum Zitat Sørensen HHB (2012) High-performance matrix–vector multiplication on the GPU. In: Proceedings of the 2011 international conference on parallel processing, Euro-Par’11, pp 377–386. Springer-Verlag, Berlin, Heidelberg Sørensen HHB (2012) High-performance matrix–vector multiplication on the GPU. In: Proceedings of the 2011 international conference on parallel processing, Euro-Par’11, pp 377–386. Springer-Verlag, Berlin, Heidelberg
34.
Zurück zum Zitat Zhang N (2012) A novel parallel scan for multicore processors and its application in sparse matrix–vector multiplication. IEEE Trans Parallel Distrib Syst 23(3):397–404. doi:10.1109/TPDS.2011.174 CrossRef Zhang N (2012) A novel parallel scan for multicore processors and its application in sparse matrix–vector multiplication. IEEE Trans Parallel Distrib Syst 23(3):397–404. doi:10.​1109/​TPDS.​2011.​174 CrossRef
35.
Zurück zum Zitat Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel J (2007) Optimization of sparse matrix–vector multiplication on emerging multicore platforms. In: Proceedings of the 2007 ACM/IEEE conference on supercomputing, SC ’07, pp 38:1–38:12. ACM, New York, NY, USA. doi:10.1145/1362622.1362674 Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel J (2007) Optimization of sparse matrix–vector multiplication on emerging multicore platforms. In: Proceedings of the 2007 ACM/IEEE conference on supercomputing, SC ’07, pp 38:1–38:12. ACM, New York, NY, USA. doi:10.​1145/​1362622.​1362674
36.
Zurück zum Zitat Goumas G, Kourtis K, Anastopoulos N, Karakasis V, Koziris N (2009) Performance evaluation of the sparse matrix–vector multiplication on modern architectures. J Supercomput 50(1):36–77. doi:10.1007/s11227-008-0251-8 Goumas G, Kourtis K, Anastopoulos N, Karakasis V, Koziris N (2009) Performance evaluation of the sparse matrix–vector multiplication on modern architectures. J Supercomput 50(1):36–77. doi:10.​1007/​s11227-008-0251-8
37.
Zurück zum Zitat Michailidis PD, Margaritis KG (2010) Performance models for matrix computations on multicore processors using OpenMP. In: Proceedings of the 2010 international conference on parallel and distributed computing. Applications and Technologies, PDCAT ’10, pp 375–380. IEEE Computer Society, Washington, DC, USA. doi:10.1109/PDCAT.2010.52 Michailidis PD, Margaritis KG (2010) Performance models for matrix computations on multicore processors using OpenMP. In: Proceedings of the 2010 international conference on parallel and distributed computing. Applications and Technologies, PDCAT ’10, pp 375–380. IEEE Computer Society, Washington, DC, USA. doi:10.​1109/​PDCAT.​2010.​52
39.
Zurück zum Zitat Waghmare VN, Kendre SV, Chordiya SG (2011) Article: performance analysis of matrix–vector multiplication in hybrid (MPI + OpenMP). Int J Comput Appl 22(5):22–25. Published by Foundation of Computer Science Waghmare VN, Kendre SV, Chordiya SG (2011) Article: performance analysis of matrix–vector multiplication in hybrid (MPI + OpenMP). Int J Comput Appl 22(5):22–25. Published by Foundation of Computer Science
40.
Zurück zum Zitat Baker AH, Schulz M, Yang UM (2011) On the performance of an algebraic multigrid solver on multicore clusters. In: Proceedings of the 9th international conference on high performance computing for computational science, VECPAR’10, pp 102–115. Springer-Verlag, Berlin, Heidelberg. http://dl.acm.org/citation.cfm?id=1964238.1964252 Baker AH, Schulz M, Yang UM (2011) On the performance of an algebraic multigrid solver on multicore clusters. In: Proceedings of the 9th international conference on high performance computing for computational science, VECPAR’10, pp 102–115. Springer-Verlag, Berlin, Heidelberg. http://​dl.​acm.​org/​citation.​cfm?​id=​1964238.​1964252
42.
Zurück zum Zitat Bhandarkar SM, Arabnia HR (1995) The REFINE multiprocessor—theoretical properties and algorithms. Parallel Comput 21(11):1783–1805CrossRef Bhandarkar SM, Arabnia HR (1995) The REFINE multiprocessor—theoretical properties and algorithms. Parallel Comput 21(11):1783–1805CrossRef
43.
Zurück zum Zitat Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box. pp 349–357 Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box. pp 349–357
44.
Zurück zum Zitat Wani MA, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable MultiRing network. J Supercomput 25(1):43–62MATHCrossRef Wani MA, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable MultiRing network. J Supercomput 25(1):43–62MATHCrossRef
45.
Zurück zum Zitat Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192CrossRef Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192CrossRef
46.
Zurück zum Zitat Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Comput Graph Forum 8(1):3–11CrossRef Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Comput Graph Forum 8(1):3–11CrossRef
47.
Zurück zum Zitat Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114CrossRef Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114CrossRef
48.
Zurück zum Zitat Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(5):425–432CrossRef Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(5):425–432CrossRef
49.
Zurück zum Zitat Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269MATHCrossRef Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269MATHCrossRef
50.
Zurück zum Zitat Arabnia HR, Oliver MA (1987) Arbitrary rotation of raster images with SIMD machine architectures. Comput Graph Forum 6(1):3–11CrossRef Arabnia HR, Oliver MA (1987) Arbitrary rotation of raster images with SIMD machine architectures. Comput Graph Forum 6(1):3–11CrossRef
51.
Zurück zum Zitat Bhandarkar SM, Arabnia HR, Smith JW (1995) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell 9(02):201–229CrossRef Bhandarkar SM, Arabnia HR, Smith JW (1995) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell 9(02):201–229CrossRef
52.
Zurück zum Zitat Arabnia H (1995) A distributed stereocorrelation algorithm. In: Computer communications and networks, 1995. Proceedings, 4th international conference on, pp 479–482, IEEE Arabnia H (1995) A distributed stereocorrelation algorithm. In: Computer communications and networks, 1995. Proceedings, 4th international conference on, pp 479–482, IEEE
Metadaten
Titel
A methodology for speeding up matrix vector multiplication for single/multi-core architectures
verfasst von
Vasilios Kelefouras
Angeliki Kritikakou
Elissavet Papadima
Costas Goutis
Publikationsdatum
01.07.2015
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 7/2015
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-015-1409-9

Weitere Artikel der Ausgabe 7/2015

The Journal of Supercomputing 7/2015 Zur Ausgabe

Premium Partner