Skip to main content

2010 | Buch

Software Automatic Tuning

From Concepts to State-of-the-Art Results

herausgegeben von: Ken Naono, Keita Teranishi, John Cavazos, Reiji Suda

Verlag: Springer New York

insite
SUCHEN

Über dieses Buch

Automatic Performance Tuning is a new software paradigm which enables software to be high performance in any computing environment. Its methodologies have been developed over the past decade, and it is now rapidly growing in terms of its scope and applicability, as well as in its scientific knowledge and technological methods. Software developers and researchers in the area of scientific and technical computing, high performance database systems, optimized compilers, high performance systems software, and low-power computing will find this book to be an invaluable reference to this powerful new paradigm.

Inhaltsverzeichnis

Frontmatter

Introduction

Frontmatter
Chapter 1. Software Automatic Tuning: Concepts and State-of-the-Art Results
Abstarct
This book is dedicated to Software Automatic Tuning. Software automatic tuning has several crucial advantages over the related technologies previously known, as has been widely recognized since the 1990s. Since then research and development on software automatic tuning have been conducted with increasing intensity and extent.
Reiji Suda, Ken Naono, Keita Teranishi, John Cavazos

Achievements in Scientific Computing

Frontmatter
Chapter 2. ATLAS Version 3.9: Overview and Status
Abstract
This paper describes the widely used ATLAS (Automatically Tuned Linear Algebra Software) project as it stands today. ATLAS is an instantiation of a paradigm in high performance library production and maintenance, which we term AEOS (Automated Empirical Optimization of Software); this style of library management has been created to allow software to keep pace with the incredible rate of hardware advancement inherent in Moore’s Law. ATLAS is the application of this AEOS paradigm to dense linear algebra software. ATLAS produces a full BLAS (Basic Linear Algebra Subprograms) library as well as provides some optimized routines for LAPACK (Linear Algebra PACKage). This paper overviews the basics of what ATLAS is and how it works, highlights some of the recent improvements available as of version 3.9.23, in addition to discussing some of the current challenges and future work.
R. Clint Whaley
Chapter 3. Autotuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS
Abstract
High-performance routines of Basic Linear Algebra Subprograms (BLAS) are frequently required in the field of numerical calculations. We have implemented Dynamically Load-balanced BLAS (DL-BLAS) to enhance the performance of BLAS when other tasks use the CPU resources of multi-core CPU architectures. DL-BLAS tiles matrices into submatrices to construct subtasks and dynamically assigns tasks to CPU cores. We found that the dimensions of the submatrices used in DL-BLAS affect the performance. To achieve high performance, we must solve an optimization problem in which the variables are the dimensions of the submatrices. The search space of the optimization problem is so vast that an exhaustive search is unrealistic. We propose an autotuning search algorithm that consists of Diagonal Search, Reductive Search, and Parameter Selection. The proposed autotuning algorithm provides semioptimal parameters in realistic computing time. Using the proposed algorithm, we obtain parameters that provide the best performance in most cases. As a result, in several performance evaluation tests, DL-BLAS achieved a higher performance than ATLAS or GotoBLAS.
Yuta Sawa, Reiji Suda
Chapter 4. Automatic Tuning for Parallel FFTs
Abstract
In this paper, we propose an implementation of parallel fast Fourier transforms (FFTs) with automatic performance tuning on distributed-memory parallel computers. A blocking algorithm for parallel FFTs utilizes cache memory effectively. Since the optimal block size may depend on the problem size, we propose a method to determine the optimal block size that minimizes the number of cache misses. In addition, parallel FFTs require intensive all-to-all communication, which affects the performance of FFTs. An automatic tuning of all-to-all communication is also implemented. The performance results demonstrate that the proposed implementation of parallel FFTs with automatic performance tuning is efficient for improving the performance.
Daisuke Takahashi
Chapter 5. Dynamic Programming Approaches to Optimizing the Blocking Strategy for Basic Matrix Decompositions
Abstract
In this chapter, we survey several approaches to optimizing the blocking strategy for basic matrix decompositions, such as LU, Cholesky, and QR. Conventional blocking strategies such as fixed-size blocking and recursive blocking are widely used to optimize the performance of these decompositions. However, these strategies have only a small number of parameters such as the block size or the level of recursion and are not sufficiently flexible to exploit the performance of modern high-performance architectures. As such, several attempts have been made to define a much larger class of strategies and to choose the best strategy among them according to the target machine and the matrix size. The number of candidate strategies is usually exponential in the size of the matrix. However, with the use of dynamic programming, the cost of optimization can be reduced to a realistic level. As representatives of such approaches, we survey variable-size blocking, generalized recursive blocking, and the combination of variable-size blocking and the TSQR algorithm. Directions for future research are also discussed.
Yusaku Yamamoto, Takeshi Fukaya
Chapter 6. Automatic Tuning of the Division Number in the Multiple Division Divide-and-Conquer for Real Symmetric Eigenproblem
Abstract
In this study, we propose a novel algorithm which estimates the optimal division number in the multiple division divide-and-conquer (DCk) for real symmetric tridiagonal eigenproblem. Using the proposed algorithm, we establish an automatically tuned DCk algorithm (ATDCk), in which the quasi-optimal division number used in DCk is automatically determined by a simple but efficient pre-test to the input matrix, prior to the main computation of the eigenvalues and eigenvectors. The efficiency of the ATDCk is confirmed by numerical experiment covering a wide range of test matrices including the physical and statistical models as well as the matrix market. The comparison with the LAPACK routine DSTEVD based on the usual divide-and-conquer with the division number 2 (DC2) is also made.
Yusuke Ishikawa, Junichi Tamura, Yutaka Kuwajima, Takaomi Shigehara
Chapter 7. Automatically Tuned Mixed-Precision Conjugate Gradient Solver
Abstract
Linear algebra solvers such as the conjugate gradients method are at the base of numerous scientific and industrial applications. Among these, Krylov type iterative solvers are highly memory bounded, meaning that their bottleneck lies in the transfer of data from memory. The tremendous computational power current multicore processors and hardware accelerators, such as the graphic processing unit (GPU), are capable of, puts a great strain on interconnects, both external (e.g., networks) and internal (e.g., memory and PCIe buses). By using a lower precision for most of the computations, the technique of iterative refinement enables one to reduce the amount of memory being transferred at all levels, thus increasing computational performance. Moreover, it makes possible the use of more cost-effective accelerators, such as cheap GPUs, that lack support for native double precision, for tasks requiring double precision accuracy. In this chapter, we propose two heuristics for automatically setting the two parameters needed for using iterative refinement: the inner residual reduction target and the stopping criteria. Although the heuristics prove effective for most matrices from our test collection, we find many cases in which the increase in iterations due to the restarting of the solver leads to an actual decrease in performance.
Serban Georgescu, Hiroshi Okuda
Chapter 8. Automatically Tuned Sparse Eigensolvers
Abstract
Automatic tuning methods for matrix computations have been investigated since the 1990s. These investigations have revealed that sparse matrix solvers require higher matrix software semantics tuning and more adaptive run-time parameter tuning. The goal of this chapter is to propose a new method of run-time automatic tuning for sparse matrix eigensolvers. In the proposed method, a statistical criterion that captures fluctuations of convergence is introduced for run-time automatic tuning. The results of numerical experiments demonstrate that, in the most effective cases, the proposed method achieves approximately five times faster performance. These results imply that automatic tuning with higher semantics is critical for high performance.
Ken Naono, Takao Sakurai, Masashi Egi
Chapter 9. Systematic Performance Evaluation of Linear Solvers Using Quality Control Techniques
Abstract
A performance evaluation framework for the solution schemes for sparse linear systems is proposed. The framework systematically constructs a performance database that provides a visual diagram of solution algorithms’ performance and characteristics to represent the relationship between the solution algorithms and solution problems. In addition, the database model is best used with software engineering techniques to facilitate automatic tuning of sparse linear solvers. This approach resembles the techniques used in quality control. Two types of cases using this approach are presented. One involves knowledge discovery in a database and reveals that preconditioning is more effective than the choice of solver for obtaining rapid convergence of iterative solutions. The other case is an improvement in quality related to numerical solving processes.
Shoji Itoh, Masaaki Sugihara
Chapter 10. Application of Alternating Decision Trees in Selecting Sparse Linear Solvers
Abstract
The solution of sparse linear systems, a fundamental and resource-intensive task in scientific computing, can be approached through multiple algorithms. Using an algorithm well adapted to characteristics of the task can significantly enhance the performance, such as reducing the time required for the operation, without compromising the quality of the result. However, the “best” solution method can vary even across linear systems generated in course of the same PDE-based simulation, thereby making solver selection a very challenging problem. In this paper, we use a machine learning technique, Alternating Decision Trees (ADT), to select efficient solvers based on the properties of sparse linear systems and runtime-dependent features, such as the stages of simulation. We demonstrate the effectiveness of this method through empirical results over linear systems drawn from computational fluid dynamics and magnetohydrodynamics applications. The results also demonstrate that using ADT can resolve the problem of “over-fitting”, which occurs when limited amount of data is available.
Sanjukta Bhowmick, Victor Eijkhout, Yoav Freund, Erika Fuentes, David Keyes
Chapter 11. Toward Automatic Performance Tuning for Numerical Simulations in the SILC Matrix Computation Framework
Abstract
This chapter presents a performance modeling method for numerical simulations in the SILC matrix computation framework. An application program of SILC is a client of a SILC server that provides the client with access to matrix computation libraries in an environment- and language-independent manner. The scope of the present study is to model the performance of a SILC client conducting a numerical simulation by means of a parallel SILC server running on a shared-memory parallel machine. The proposed method employs a simple performance model that describes the execution time of a SILC client as a function of the number of threads on which a parallel SILC server runs. The obtained performance model is then used to determine the optimal number of threads for the particular combination of the SILC client and server. The proposed method was applied to three application programs in combination with an OpenMP-based parallel SILC server running on SGI Altix 3700. Experimental results showed that the proposed method yields accurate estimates of the execution time in most cases. Based on the proposed performance modeling method, an automatic performance tuning mechanism for numerical simulations in SILC is also presented.
Tamito Kajiyama, Akira Nukada, Reiji Suda, Hidehiko Hasegawa, Akira Nishida
Chapter 12. Exploring Tuning Strategies for Quantum Chemistry Computations
Abstract
Effective execution of applications using a parallel computing environment that share resources such as network bandwidth, I/O and main memory require that some sort of adaptive mechanism be put in place, which will enable efficient usage of these resources. The adaptation adjusts the most computationally intensive part of the application thus leading to an efficient execution. General atomic and molecular electronic structure system (GAMESS), used for ab-initio molecular quantum chemistry calculations, uses NICAN for dynamically making adaptations so as to improve the application performance in heavy load conditions. The adaptation mechanism uses the iteration time of each Self Consistent Field (SCF) iteration to make a decision. The advantage of such a mechanism is the ability to modify the application execution in a very simplistic yet effective manner. In this work, we have explored methods to expand this adaptation mechanism by analyzing the GAMESS performance through the use of fine-grained data.
Lakshminarasimhan Seshagiri, Meng-Shiou Wu, Masha Sosonkina, Zhao Zhang
Chapter 13. Automatic Tuning of CUDA Execution Parameters for Stencil Processing
Abstract
Recently, Compute Unified Device Architecture (CUDA) has enabled Graphics Processing Units (GPUs) to accelerate various applications. However, to exploit the GPU’s computing power fully, a programmer has to carefully adjust some CUDA execution parameters even for simple stencil processing kernels. Hence, this paper develops an automatic parameter tuning mechanism based on profiling to predict the optimal execution parameters. This paper first discusses the scope of the parameter exploration space determined by GPU’s architectural restrictions. To find the optimal execution parameters, performance models are created by profiling execution times of kernel using each promising parameter configuration. The execution parameters are determined by using those performance models. This paper evaluates the performance improvement due to the proposed mechanism using two benchmark programs. From the evaluation results, it is clarified that the proposed mechanism can appropriately select a suboptimal Cooperative Thread Array (CTA) configuration whose performance is comparable to the optimal one.
Katsuto Sato, Hiroyuki Takizawa, Kazuhiko Komatsu, Hiroaki Kobayashi
Chapter 14. Static Task Cluster Size Determination in Homogeneous Distributed Systems
Abstract
In a distributed system, which consists of an unknown number of processors, it is important to derive an appropriate number of processors by which the good schedule length is obtained by a task scheduling. Many task clustering heuristics have been proposed to determine the number of processors and to minimize the schedule length for scheduling a directed acyclic graph (DAG) application. However, those heuristics are not aware of the actual number of existing processors. As a result, the number of processors determined by an existing task clustering may exceed that of actually existing processors. Therefore, conventional approaches adopt merging of each cluster for reducing the number of clusters at the expense of decreasing degree of task parallelism. In this paper, we present a static cluster size determination method, which derives the lower bound of the cluster size with considering the DAG structure and the task size to data size ratio to suppress the schedule length with the small number of processors. Our experimental evaluations by simulations show that the lower bound of each cluster size determined by the proposed method has a good impact on both the schedule length and the processor utilization.
Hidehiro Kanemitsu, Gilhyon Lee, Hidenori Nakazato, Takashige Hoshiai, Yoshiyori Urano

Evolution to a General Paradigm

Frontmatter
Chapter 15. Algorithmic Parameter Optimization of the DFO Method with the OPAL Framework
Abstract
We introduce the opal framework in which the identification of good algorithmic parameters is interpreted as a black box optimization problem whose variables are the algorithmic parameters. In addition to the target algorithm, the user of the framework must supply or select two components. The first is a set of metrics defining the notions of acceptable parameter values and of performance of the algorithm. The second is a collection of representative sets of valid input data for the target algorithm. opal may be applied to virtually any context in which parameter tuning leads to increased performance. The black box optimization problem is solved by way of a direct-search method that provides local optimality guarantees and offers a certain flexibility. We illustrate its use on a parameter-tuning application on the DFO method from the field of derivative-free optimization.
Charles Audet, Cong-Kien Dang, Dominique Orban
Chapter 16. A Bayesian Method of Online Automatic Tuning
Abstract
This chapter discusses mathematical methods for online automatic tuning. After formulating the abstract model of automatic tuning, we review the proposed method, which comprises several novel concepts of automatic tuning, such as online automatic tuning, Bayesian data analysis for quantitative treatments of uncertainties, Bayesian suboptimal sequential experimental design, asymptotic optimality, finite startup, and infinite dilution. Experimental results reveal that the Bayesian sequential experimental design has advantages over random sampling, although random sampling combined with an accurate cost function model can be as good as the Bayesian sequential experimental design.
Reiji Suda
Chapter 17. ABCLibScript: A Computer Language for Automatic Performance Tuning
Abstract
ABCLibScript was developed to enable the low-cost development of software with auto-tuning facility for numerical computational processing. In this chapter, we introduce the basic concept, software framework, programming examples, and future directions of ABCLibScript. Several potential applications of ABCLibScript in advanced computing environments are also presented.
Takahiro Katagiri
Chapter 18. Automatically Tuning Task-Based Programs for Multicore Processors
Abstract
We present a new technique to automatically optimize parallel software for multicore processors. We have implemented the technique for Bamboo, a task-based extension to Java. Optimizing applications for multicore processors requires balancing the competing concerns of parallelism and communication costs. Bamboo uses high-level simulation to explore how to best trade off these competing concerns for an application. The compiler begins by generating several initial candidate implementations. The compiler then uses high-level simulation with profile statistics to evaluate these candidate implementations. It uses an as-built critical path analysis to automatically identify opportunities to improve the candidate implementation and then uses directed simulated annealing to evaluate possible optimizations.
Jin Zhou, Brian Demsky
Chapter 19. Efficient Program Compilation Through Machine Learning Techniques
Abstract
The wealth of available compiler optimizations leads to the dual problems of finding the best set of optimizations and the best heuristic parameters to tune each optimization. We describe how machine learning techniques, such as logistic regression, can be used to address these problems. We focus on decreasing the compile time for a static commercial compiler, while preserving the execution time. We show that we can speed up the compile process by at least a factor of two with almost the same generated code quality on the SPEC2000 benchmark suite, and that our logistic classifier achieves the same prediction quality for non-SPEC benchmarks.
Gennady Pekhimenko, Angela Demke Brown
Chapter 20. Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology
Abstract
Autotuning technology has emerged recently as a systematic process for evaluating alternative implementations of a computation to select the best-performing solution for a particular architecture. Specialization optimizes code customized to a particular class of input data. This paper presents a compiler optimization approach that combines novel autotuning compiler technology with specialization for expected data set sizes of key computations, focused on matrix multiplication of small matrices. We describe compiler techniques developed for this approach, including the interface to a polyhedral transformation system for generating specialized code and the heuristics used to prune the enormous search space of alternative implementations. We demonstrate significantly better performance than directly using libraries such as GOTO, ATLAS, and ACML BLAS that are not specifically optimized for the problem sizes on hand. In a case study of Nek5000, a spectral-element-based code that extensively uses the specialized matrix multiply, we demonstrate a performance improvement of 36% for the full application.
Jaewook Shin, Mary W. Hall, Jacqueline Chame, Chun Chen, Paul D. Hovland
Backmatter
Metadaten
Titel
Software Automatic Tuning
herausgegeben von
Ken Naono
Keita Teranishi
John Cavazos
Reiji Suda
Copyright-Jahr
2010
Verlag
Springer New York
Electronic ISBN
978-1-4419-6935-4
Print ISBN
978-1-4419-6934-7
DOI
https://doi.org/10.1007/978-1-4419-6935-4

Neuer Inhalt