Top

The Journal of Supercomputing

Published in:

07-03-2020

Integrating software and hardware hierarchies in an autotuning method for parallel routines in heterogeneous clusters

Authors: Jesús Cámara, Javier Cuenca, Domingo Giménez

Published in: The Journal of Supercomputing | Issue 12/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

A hierarchical approach for autotuning linear algebra routines on heterogeneous platforms is presented. Hierarchy helps to alleviate the difficulties of tuning parallel routines for high-performance computing systems. This paper analyzes the application of the hierarchical approach at both the hardware and software levels, using the basic matrix multiplication and the Strassen multiplication as proof of concept on multicore+coprocessor nodes. In this way, the hierarchical approach allows partial delegation of the efficient exploitation of the computing units in the node to the underlying direct autotuned matrix multiplication used in the base case.

previous article Modern architecture for photonic networks-on-chip

next article Opportunistic scheduling and resources consolidation system based on a new economic model

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov S (2009) Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J Phys: Conf Ser 180(1):012037

Ansel J, Kamil S, Veeramachaneni K, Ragan-Kelley J, Bosboom J, O’Reilly U-M, Amarasinghe S (2014) OpenTuner: An extensible framework for program autotuning. In: 23rd International Conference on Parallel Architectures and Compilation Techniques. Edmonton, Canada, ACM, pp 303–316

Augonnet C, Thibault S, Namyst R, Wacrenier P-A (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput: Pract Exp 23(2):187–198CrossRef

Batory D (1992) The design and implementation of hierarchical software systems with reusable components. ACM Trans Softw Eng Methodol 1:355–398CrossRef

Bernabé G, Cuenca J, García L-P, Giménez D (2015) Auto-tuning techniques for linear algebra routines on hybrid platforms. J Comput Sci 10:299–310MathSciNetCrossRef

Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra JJ, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user’s guide. Society for Industrial and Applied Mathematics, PhiladelphiaCrossRef

Cámara J, Cuenca J, Giménez D (2019) Hierarchical automatic optimization of high and medium level linear algebra routines. In: 18th International Conference on Computational and Mathematical Methods in Science and Engineering

Chameleon: Dense linear algebra subroutines for heterogeneous and distributed architectures. https://gitlab.inria.fr/solverstack/chameleon. Accessed Sept 2019

cuBLAS. http://docs.nvidia.com/cuda/cublas/. Accessed Sept 2019

10.

Cuenca J, García L-P, Giménez D, Herrera F-J (2017) Guided installation of basic linear algebra routines in a cluster with manycore components. Concurr Comput: Pract Exp 29(15):e4112CrossRef

11.

Dackland K, Kågström B (1996) A hierarchical approach for performance analysis of ScaLAPACK-based routines using the distributed linear algebra machine. In: Applied Parallel Computing, Industrial Computation and Optimization, Third International Workshop, PARA96. Lyngby, Denmark, pp 186–195

12.

Fatica M (2009) Accelerating Linpack with CUDA on heterogenous clusters. In: 2nd Workshop on General Purpose Processing on Graphics Processing Units. NY, USA, ACM, New York, pp 46–51

13.

Golub G, Van Loan CF (2013) Matrix computations, 4th edn. The John Hopkins University Press, BaltimoreMATH

14.

Goto K, van de Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1–12:25MathSciNetCrossRef

15.

Hasanov K, Quintin J-N, Lastovetsky AL (2015) Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms. J Supercomput 71(11):3991–4014CrossRef

16.

Intel MKL. http://software.intel.com/en-us/intel-mkl/. Accessed Sept 2019

17.

Ohshima S, Kise K, Katagiri T, Yuba T (2007) Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment. In: 7th International Conference on High Performance Computing for Computational Science. Springer-Verlag, pp 305–318

18.

Pfaffe P, Grosser T, Tillmann M (2019) Efficient hierarchical online-autotuning: A case study on polyhedral accelerator mapping. In: Proceedings of the ACM International Conference on Supercomputing, ICS ’19, New York, USA, ACM, pp 354–366

19.

PLASMA. http://icl.cs.utk.edu/plasma/. Accessed Sept 2019

20.

Porterfield A, Bhalachandra S, Wang W, Fowler R (2016) Variability: a tuning headache. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 1069–1072

21.

Stanisic L, Thibault S, Legrand A, Videau B, Méhaut J-F (2015) Faithful performance prediction of a dynamic task-based runtime system for heterogeneous multi-core architectures. Concurr Comput: Pract Exp 27(16):4075–4090CrossRef

22.

Williams S, Oliker L, Carter J, Shalf J (2011) Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, New York, USA, ACM, pp 1–12

23.

Yokota R, Barba L (2012) Hierarchical N-body simulations with autotuning for heterogeneous systems. Comput Sci Eng 14(3):30–39CrossRef

24.

Zhong Z, Rychkov V, Lastovetsky AL (2015) Data partitioning on multicore and multi-GPU platforms using functional performance models. IEEE Trans Comput 64(9):2506–2518MathSciNetCrossRef

Title: Integrating software and hardware hierarchies in an autotuning method for parallel routines in heterogeneous clusters
Authors: Jesús Cámara
Javier Cuenca
Domingo Giménez
Publication date: 07-03-2020
Publisher: Springer US
Published in: The Journal of Supercomputing / Issue 12/2020
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-020-03235-9

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 12/2020

Optimal low-latency network topologies for cluster performance enhancement

Job scheduler for streaming applications in heterogeneous distributed processing systems

Smart seed selection-based effective black box fuzzing for IIoT protocol

Spatiotemporal feature mining algorithm based on multiple minimum supports of pattern growth in Internet of Things

clusterCL: comprehensive support for multi-kernel data-parallel applications in heterogeneous asymmetric clusters

Pyramid context learning for object detection

Premium Partner