nach oben

The Journal of Supercomputing

Erschienen in:

01.07.2015

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

verfasst von: Vasilios Kelefouras, Angeliki Kritikakou, Elissavet Papadima, Costas Goutis

Erschienen in: The Journal of Supercomputing | Ausgabe 7/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this paper, a new methodology for computing the Dense Matrix Vector Multiplication, for both embedded (processors without SIMD unit) and general purpose processors (single and multi-core processors, with SIMD unit), is presented. This methodology achieves higher execution speed than ATLAS state-of-the-art library (speedup from 1.2 up to 1.45). This is achieved by fully exploiting the combination of the software (e.g., data reuse) and hardware parameters (e.g., data cache associativity) which are considered simultaneously as one problem and not separately, giving a smaller search space and high-quality solutions. The proposed methodology produces a different schedule for different values of the (i) number of the levels of data cache; (ii) data cache sizes; (iii) data cache associativities; (iv) data cache and main memory latencies; (v) data array layout of the matrix and (vi) number of cores.

Vorheriger Artikel Optimizing the configuration of combined high-radix switches

Nächster Artikel Reversible logic based multiplication computing unit using binary tree data structure

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Whaley RC, Petitet A (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw: Pract Exp 35(2):101–121

OpenBlas (2012). http://xianyi.github.com/OpenBLAS

Krivutsenko A (2008) GotoBLAS—anatomy of a fast matrix multiplication. Technical report

Guennebaud G, Jacob B et al (2010) Eigen v3. http://eigen.tuxfamily.org

Intel: Intel MKL (2012). http://software.intel.com/en-us/intel-mkl

Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing. ACM SIGARC, Vienna, Austria

Frigo M, Johnson SG (1997) The fastest Fourier transform in the west. Technical report. Cambridge, MA, USA

Milder P, Franchetti F, Hoe JC, Püschel M (2012) Computer generation of hardware for linear digital signal processing transforms. ACM Trans Des Autom Electron Syst 17(2) 15:1–15:33. doi:10.1145/2159542.2159547

Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1):21–38

10.

Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi:10.1145/2512432

11.

Bacon DF, Graham SL, Sharp OJ (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26(4):345–420. doi:10.1145/197405.197406

12.

Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization (FDDO)

13.

Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on code generation and optimization: feedback-directed and runtime optimization, CGO ’03, pp 204–215. IEEE Computer Society, Washington, DC, USA. http://dl.acm.org/citation.cfm?id=776261.776284

14.

Cooper KD, Subramanian D, Torczon L (2002) Adaptive optimizing compilers for the 21st century. J Supercomput 23(1):7–22. doi:10.1023/A:1015729001611

15.

Kisuki T, Knijnenburg PMW, O’Boyle MFP, Bodin F, Wijshoff HAG (1999) A feasibility study in iterative compilation. In: Proceedings of the 2nd international symposium on high performance computing, ISHPC ’99, pp 121–132. Springer-Verlag, London, UK. http://dl.acm.org/citation.cfm?id=646347.690219

16.

Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36. doi:10.1145/1509864.1509865

17.

Kulkarni P, Hines S, Hiser J, Whalley D, Davidson J, Jones D (2004) Fast searches for effective optimization phase sequences. SIGPLAN Not 39(6):171–182. doi:10.1145/996893.996863 CrossRef

18.

Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems, CASES ’11, pp 65–74. ACM, New York, NY, USA. doi:10.1145/2038698.2038711

19.

Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications, AIMSA ’02, pp 41–50. Springer-Verlag, London, UK. http://dl.acm.org/citation.cfm?id=646053.677574

20.

Stephenson M, Amarasinghe S, Martin M, O’Reilly UM (2003) Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not 38(5):77–90 (2003). doi:10.1145/780822.781141

21.

Tartara M, Crespi Reghizzi S (2013) Continuous learning of compiler heuristics. ACM Trans Archit Code Optim 9(4):46:1–46:25. doi:10.1145/2400682.2400705

22.

Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization, CGO ’06, pp 295–305. IEEE Computer Society, Washington, DC, USA. doi:10.1109/CGO.2006.37

23.

Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not 42(6):89–100. doi:10.1145/1273442.1250746

24.

Simplescalar CI, Burger D, Austin TM (1997) The SimpleScalar tool set, version 2.0. Technical report

25.

Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35MATHCrossRef

26.

Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: 9th SIAM conference on parallel processing for scientific computing. CD-ROM Proceedings

27.

Whaley RC, Dongarra J (1998) Automatically tuned linear algebra software. In: SuperComputing 1998: high performance networking and computing

28.

Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Technical report. UT-CS-97-366, University of Tennessee

29.

See homepage for details: ATLAS homepage (2012). http://math-atlas.sourceforge.net/

30.

Fujimoto N (2008) Dense matrix–vector multiplication on the CUDA architecture. Parallel Process Lett 18(4):511–530MathSciNetCrossRef

31.

Fujimoto N (2008) Faster matrix–vector multiplication on GeForce 8800GTX. In: IPDPS, pp 1–8. IEEE. http://dblp.uni-trier.de/db/conf/ipps/ipdps2008.html

32.

Hendrickson B, Leland R, Plimpton S (1995) An efficient parallel algorithm for matrix–vector multiplication. Int J High Speed Comput 7:73–88CrossRef

33.

Sørensen HHB (2012) High-performance matrix–vector multiplication on the GPU. In: Proceedings of the 2011 international conference on parallel processing, Euro-Par’11, pp 377–386. Springer-Verlag, Berlin, Heidelberg

34.

Zhang N (2012) A novel parallel scan for multicore processors and its application in sparse matrix–vector multiplication. IEEE Trans Parallel Distrib Syst 23(3):397–404. doi:10.1109/TPDS.2011.174 CrossRef

35.

Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel J (2007) Optimization of sparse matrix–vector multiplication on emerging multicore platforms. In: Proceedings of the 2007 ACM/IEEE conference on supercomputing, SC ’07, pp 38:1–38:12. ACM, New York, NY, USA. doi:10.1145/1362622.1362674

36.

Goumas G, Kourtis K, Anastopoulos N, Karakasis V, Koziris N (2009) Performance evaluation of the sparse matrix–vector multiplication on modern architectures. J Supercomput 50(1):36–77. doi:10.1007/s11227-008-0251-8

37.

Michailidis PD, Margaritis KG (2010) Performance models for matrix computations on multicore processors using OpenMP. In: Proceedings of the 2010 international conference on parallel and distributed computing. Applications and Technologies, PDCAT ’10, pp 375–380. IEEE Computer Society, Washington, DC, USA. doi:10.1109/PDCAT.2010.52

38.

Schmollinger M, Kaufmann M (2002) Algorithms for SMP-clusters dense matrix–vector multiplication. In: Proceedings of the 16th international parallel and distributed processing Sysmposium, IPDPS ’02, pp 57–. IEEE Computer Society, Washington, DC, USA. http://dl.acm.org/citation.cfm?id=645610.661893

39.

Waghmare VN, Kendre SV, Chordiya SG (2011) Article: performance analysis of matrix–vector multiplication in hybrid (MPI + OpenMP). Int J Comput Appl 22(5):22–25. Published by Foundation of Computer Science

40.

Baker AH, Schulz M, Yang UM (2011) On the performance of an algebraic multigrid solver on multicore clusters. In: Proceedings of the 9th international conference on high performance computing for computational science, VECPAR’10, pp 102–115. Springer-Verlag, Berlin, Heidelberg. http://dl.acm.org/citation.cfm?id=1964238.1964252

41.

Parallel methods for matrix–vector multiplication. http://www.hpcc.unn.ru/mskurs/ENG/DOC/pp07.pdf

42.

Bhandarkar SM, Arabnia HR (1995) The REFINE multiprocessor—theoretical properties and algorithms. Parallel Comput 21(11):1783–1805CrossRef

43.

Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box. pp 349–357

44.

Wani MA, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable MultiRing network. J Supercomput 25(1):43–62MATHCrossRef

45.

Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192CrossRef

46.

Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Comput Graph Forum 8(1):3–11CrossRef

47.

Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114CrossRef

48.

Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(5):425–432CrossRef

49.

Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269MATHCrossRef

50.

Arabnia HR, Oliver MA (1987) Arbitrary rotation of raster images with SIMD machine architectures. Comput Graph Forum 6(1):3–11CrossRef

51.

Bhandarkar SM, Arabnia HR, Smith JW (1995) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell 9(02):201–229CrossRef

52.

Arabnia H (1995) A distributed stereocorrelation algorithm. In: Computer communications and networks, 1995. Proceedings, 4th international conference on, pp 479–482, IEEE

53.

Intel core 2 duo processor E6550. http://ark.intel.com/Product.aspx?id=30783

54.

Intel core 2 duo processor T6600. http://ark.intel.com/products/37255/Intel-Core2-Duo-Processor-T6600

55.

Intel i7-2600K Processor. http://ark.intel.com/products/52214

Titel: A methodology for speeding up matrix vector multiplication for single/multi-core architectures
verfasst von: Vasilios Kelefouras
Angeliki Kritikakou
Elissavet Papadima
Costas Goutis
Publikationsdatum: 01.07.2015
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 7/2015
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-015-1409-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 7/2015

Accelerating MRI reconstruction via three-dimensional dual-dictionary learning using CUDA

A fast and energy-efficient Hamming decoder for software-defined radio using graphics processing units

Energy calculation for periodic multi-core scheduling in system thermal steady state with consideration of leakage and temperature dependency

2-Disjoint-path-coverable panconnectedness of crossed cubes

High-performance parallel frequent subgraph discovery

Evaluation efficiency of large-scale data set with negative data: an artificial neural network approach

Premium Partner