Skip to main content
main-content

Über dieses Buch

Automatic Performance Prediction of Parallel Programs presents a unified approach to the problem of automatically estimating the performance of parallel computer programs. The author focuses primarily on distributed memory multiprocessor systems, although large portions of the analysis can be applied to shared memory architectures as well.
The author introduces a novel and very practical approach for predicting some of the most important performance parameters of parallel programs, including work distribution, number of transfers, amount of data transferred, network contention, transfer time, computation time and number of cache misses. This approach is based on advanced compiler analysis that carefully examines loop iteration spaces, procedure calls, array subscript expressions, communication patterns, data distributions and optimizing code transformations at the program level; and the most important machine specific parameters including cache characteristics, communication network indices, and benchmark data for computational operations at the machine level.
The material has been fully implemented as part of P3T, which is an integrated automatic performance estimator of the Vienna Fortran Compilation System (VFCS), a state-of-the-art parallelizing compiler for Fortran77, Vienna Fortran and a subset of High Performance Fortran (HPF) programs.
A large number of experiments using realistic HPF and Vienna Fortran code examples demonstrate highly accurate performance estimates, and the ability of the described performance prediction approach to successfully guide both programmer and compiler in parallelizing and optimizing parallel programs.
A graphical user interface is described and displayed that visualizes each program source line together with the corresponding parameter values. P3T uses color-coded performance visualization to immediately identify hot spots in the parallel program. Performance data can be filtered and displayed at various levels of detail. Colors displayed by the graphical user interface are visualized in greyscale.
Automatic Performance Prediction of Parallel Programs also includes coverage of fundamental problems of automatic parallelization for distributed memory multicomputers, a description of the basic parallelization strategy and a large variety of optimizing code transformations as included under VFCS.

Inhaltsverzeichnis

Frontmatter

1. Introduction

Abstract
Multiprocessor systems (MPSs) have become a fundamental tool of science and engineering. They can be used to simulate highly complex processes occurring in nature, in industry, or in scientific experiments. The formulation of such simulations is based upon mathematical models typically governed by partial differential equations, whose solution in a discretized version suitable for computation may require trillions of operations. In engineering, there are many examples of applications where the use of MPSs saves millions of dollars annually, as well as saving material resources. With their help, it is possible to model the behavior of automobiles in a crash, or of airplanes in critical flight situations in a manner which is so realistic that many situations can be simulated on a computer before an actual physical prototype is built.
Thomas Fahringer

2. Model

Abstract
In this chapter we present the sequential and parallel programming models underlying this work. The first section introduces a subset of the sequential programming language which is accepted for the parallelization process. Fundamental characteristics about the data space, program state, statement instances and program statements are defined. These characteristics are equally applicable to both sequential and parallel programs. In the next section we present the important features of the parallel programming language. We describe a model for processors, data distributions and overlap areas. The most essential explicit parallel program constructs are defined. In Section 2.4 we outline the basic parallelization strategy by specifying a source-to-source translation from a sequential to an optimized parallel program using the JACOBI relaxation code. Section 2.5 describes some important transformations performed by advanced compilers, together with an explanation of how and why performance is improved. The next section displays how P 3 T and Weight Finder are used as integrated tools of VFCS. Finally, we conclude with a summary.
Thomas Fahringer

3. Sequential Program Parameters

Abstract
In order to achieve a high performance increase of a given program, a large search tree of program transformation sequences has to be analyzed by a parallelizes A common way to prune a search tree is to restrict transformations to the computation intensive program parts. Frequently, only a small part of a given application program represents a large portion of the overall runtime of the program. It is a crucial problem to find these performance intensive program parts in the original sequential program. As a consequence the parallelizer can converge the parallelization effort to these program sections. Our approach enables the user to find these program parts by runtime profiling.
Thomas Fahringer

4. Parallel Program Parameters

Abstract
The application of compiler transformations and data distributions to a parallel program may induce a variety of trade-offs, for instance, loop distribution might allow pulling communication out of a loop nest but adds overhead for additional loop header statements. Scalar expansion may help to break dependences and thus permits parallelization at the cost of additional memory to be allocated and cache misses induced by accessing arrays (expanded scalars). Loop interchange may permit pulling communication out to a higher loop nest level which in turn might cause a loss in cache performance.
Thomas Fahringer

5. Experiments

Abstract
This chapter describes experimental results of P 3 T as an integrated tool of VFCS. We present a variety of experiments to reflect the estimation accuracy of the parallel program parameters. Several representative kernels, large subroutines and reasonably sized programs are analyzed under P 3 T. Their parallel program parameters as computed by P 3 T are compared against measurements taken on the iPSC/860 hypercube. Next, we demonstrate the effectiveness of P 3 T to support the performance tuning and parallelization effort under VFCS. Finally, we will display the graphical user interface of P 3 T which enables the user to view performance information at various levels of detail.
Thomas Fahringer

6. Related Work

Abstract
In this section we first evaluate the most common techniques applied to the problem of performance prediction of parallel programs; second, we classify existing work with respect to single and multiprocessor models.
Thomas Fahringer

7. Conclusions

Abstract
In this book, we have presented a novel parameter based approach to the problem of automatic performance prediction for parallel programs. Although this book focuses primarily on distributed memory multiprocessor systems, significant portions of the analysis described can be applied to shared memory architectures as well. Even single processor performance modeling is included.
Thomas Fahringer

Backmatter

Weitere Informationen