Skip to main content

Über dieses Buch

Past and current research in computer performance analysis has focused primarily on dedicated parallel machines. However, future applications in the area of high-performance computing will not only use individual parallel systems but a large set of networked resources. This scenario of computational and data Grids is attracting a great deal of attention from both computer and computational scientists. In addition to the inherent complexity of parallel machines, the sharing and transparency of the available resources introduces new challenges on performance analysis, techniques, and systems. In order to meet those challenges, a multi-disciplinary approach to the multi-faceted problems of performance is required. New degrees of freedom will come into play with a direct impact on the performance of Grid computing, including wide-area network performance, quality-of-service (QoS), heterogeneity, and middleware systems, to mention only a few.



Performance Modeling and Analysis


Different Approaches to Automatic Performance Analysis of Distributed Applications

Parallel computing is a promising approach that provides more powerful computing capabilities for many scientific research fields to solve new problems. However, to take advantage of such capabilities it is necessary to ensure that the applications are successfully designed and that their performance is satisfactory. This implies that the task of the application designer does not finish when the application is free of functional bugs, and that it is necessary to carry out some performance analysis and application tuning to reach the expected performance. This application tuning requires a performance analysis, including the detection of performance bottlenecks, the identification of their causes and the modification of the application to improve behavior. These tasks require a high degree of expertise and are usually time consuming. Therefore, tools that automate some of these tasks are useful, especially for non-expert users. In this paper, we present three tools that cover different approaches to automatic performance analysis and tuning. In the first approach, we apply static automatic performance analysis. The second is based on run-time automatic analysis. The last approach sets out dynamic automatic performance tuning.
Tomàs Margalef, Josep Jorba, Oleg Morajko, Anna Morajko, Emilio Luque

Performance Modeling of Deterministic Transport Computations

In this work we present a performance model that encompasses the key characteristics of a Sn transport application using unstructured meshes. Sn transport is an important part of the ASCI workload. This builds on previous analysis which has been done for the case of structured meshes. The performance modeling of an unstructured grid application presents a number of complexities and subtleties that do not arise for structured grids. The resulting analytical model is parametric using basic system performance characteristics (latency, bandwidth, MFLOPS rate etc), and application characteristics (mesh size etc). It is validated on a large HP AlphaServer system showing high accuracy. The model compares favorably to a trace based modeling approach which is specific to a single mesh/processor mapping situation. The model is used to give insight into the achievable performance on possible future processing systems containing thousands of processors.
Darren J. Kerbyson, Adolfy Hoisie, Shawn D. Pautz

Performance Optimization of RK Methods Using Block-Based Pipelining

The efficiency of modern microprocessors is extremely sensitive towards the structure and memory access pattern of programs to be executed. This is caused by memory hierarchies which were introduced to reduce average memory access times. In this paper, we consider embedded Runge-Kutta (RK) methods for the solution of ordinary differential equations arising from space discretization problems for partial differential equations and study their efficient implementation on modern microprocessors. Different program variants with different execution orders and storage schemes are investigated. In particular, we explore how the potential parallelism in the stage vector computation can be exploited in a pipelining approach in order to improve the locality behavior of the RK implementations. Experiments show that this results in efficiency improvements on several recent processors.
Matthias Korch, Thomas Rauber, Gudula Rünger

Performance Evaluation of Hybrid Parallel Programming Paradigms

With the trend in the supercomputing world shifting from homogeneous machine architectures to hybrid clusters of SMP nodes, the interoperabiility of OpenMP and MPI has become a key issue in understanding and optimizing the overall system performance. While the low-level performance of MPI and OpenMP can be evaluated using existing benchmarks, the combination of the two poses new challenges. Therefore, a performance study of different hybrid programming paradigms is of high benefit for both the vendors and the user community. As part of our project, we have identified several possible combinations of the two models in order to provide qualitative and quantitative justification of situations in which any one of them is to be favoured. Collective operations are particularly important to analyze and evaluate on a hybrid platform and therefore we concentrate our study on three of them — barrier, all-to-all, and all-reduce. Issues like the optimal mix of OpenMP and MPI, the most efficient way of managing MPI communication from within OpenMP, the optimal unit of communication, and the degree of overlap between computation and communication need to be evaluated. The performance results supporting this investigation were taken on the IBM Power-3 machine at San Diego Supecomputer Center using our suite of hybrid microbenchmarks.
Achal Prabhakar, Vladimir Getov

Performance Modelling for Task-Parallel Programs

Many applications from scientific computing and physical simulations can benefit from a mixed task and data parallel implementation on parallel machines with a distributed memory organization, but it may also be the case that a pure data parallel implementation leads to faster execution times. Since the effort for writing a mixed task and data parallel implementation is large, it would be useful to have an a priori estimation of the possible benefits of such an implementation on a given parallel machine. In this article, we propose an estimation method for the execution time that is based on the modelling of computation and communication times by runtime formulas. The effect of concurrent message transmissions is captured by a contention factor for the specific target machine. To demonstrate the usefulness of the approach, we consider a complex method for the solution of ordinary differential equations with a potential for a mixed task and data parallel execution. As distributed memory machine we consider the Cray T3E and a Linux cluster.
Matthias Kühnemann, Thomas Rauber, Gudula Rünger

Collective Communication Patterns on the Quadrics Network

The efficient implementation of collective communication is a key factor to provide good performance and scalability of communication patterns that involve global data movement and global control. Moreover, this is essential to enhance the fault-tolerance of a parallel computer. For instance, to check the status of the nodes, perform some distributed algorithm to balance the load, synchronize the local clocks, or do performance monitoring. Therefore, the support for multicast communications can improve the performance and resource utilization of a parallel computer. The Quadrics interconnect (QsNET), which is being used in some of the largest machines in the world, provides hardware support for multicast. The basic mechanism consists of the capability for a message to be sent to any set of contiguous nodes in the same time it takes to send a unicast message. The two main collective communication primitives provided by the network software are the barrier synchronization and the broadcast, which are both implemented in two different ways, either using the hardware support, when nodes are contiguous, or a balanced tree and unicast messaging, otherwise. In this paper some performance results are given for the above collective communication services, that show, on the one hand, the outstanding performance of the hardware-based primitives even in the presence of a high network background traffic; and, on the other hand, the limited performance achieved with the software-based implementation.
Salvador Coll, José Duato, Francisco J. Mora, Fabrizio Petrini, Adolfy Hoisie

Performance Tools and Systems


The Design of a Performance Steering System for Component-Based Grid Applications

A major method of constructing applications to run on a computational Grid is to assemble them from components — separately deployable units of computation of well-defined functionality. Performance steering is an adaptive process involving run-time adjustment of factors affecting the performance of an application. This paper presents a design for a system capable of steering, towards a minimum run-time, the performance of a component-based application executing in a distributed fashion on a computational Grid. The proposed performance steering system controls the performance of single applications, and the basic design seeks to separate application-level and component-level concerns. The existence of a middleware resource scheduler external to the performance steering system is assumed, and potential problems are discussed. A possible model of operation is given in terms of application and component execution phases. The need for performance prediction capability, and for repositories of application-specific and component-specific performance information, is discussed. An initial implementation is briefly described.
Ken Mayes, Graham D. Riley, Rupert W. Ford, Mikel Luján, Len Freeman, Cliff Addison

Advances in the Tau Performance System

To address the increasing complexity in parallel and distributed systems and software, advances in performance technology towards more robust tools and broader, more portable implementations are needed. In doing so, new challenges for performance instrumentation, measurement, analysis, and visualization arise to address evolving requirements for how performance phenomena is observed and how performance data is used. This paper presents recent advances in the TAU performance system in four areas where improvements in performance technology are important: instrumentation control, performance mapping, performance interaction and steering, and performance databases. In the area of instrumentation control, we are concerned with the removal of instrumentation in cases of high measurement overhead. Our approach applies rule-based analysis of performance data in an iterative instrumentation process. Work on performance mapping focuses on measuring performance with respect to dynamic calling paths when the static callgraph cannot be determined prior to execution. We describe an online performance data access, analysis, and visualization system that will form the basis of a large-scale performance interaction and steering system. Finally, we describe our approach to the management of performance data in a database framework that supports multi-experiment analysis.
Allen D. Malony, Sameer Shende, Robert Bell, Kai Li, Li Li, Nick Trebon

Uniform Resource Visualization: Software and Services

Computing environments continue to increase in scale, heterogeneity, and hierarchy, with resource usage varying dynamically during program execution. Computational and data grids and distributed collaboration environments are examples. To understand performance and gain insights into developing applications that efficiently use the system resources, performance visualization has proven useful. However, visualization tools often are specific to a particular resource or level in the system, possibly with fixed views, and thus limit a user’s ability to observe and improve performance. Information integration is necessary for system-level performance monitoring. Uniform resource visualization (URV) is a component-based framework being developed to provide uniform interfaces between resource instrumentation, called resource monitoring components (RMC) and performance views, called visualization components (VC). URV supports services for connecting VCs to RMCs, and creating multi-level views, as well as visual schema definitions for sharing and reusing visualization design knowledge.
Kukjin Lee, Diane T. Rover

A Performance Analysis Tool for Interactive Grid Applications

The paper presents the main features of a performance analysis tool for applications running on the Grid, which is not limited to standard measurements, but also comprises application-specific metrics and other high-level measurements. These requirements are not well addressed by the existing tools in the area of parallel and distributed programming. The paper outlines the main ideas as well as the design details of the G-PM tool developed within the EU CrossGrid project whose aim is to widen the use of Grid technology for interactive applications. The focus is on the operation of G-PM’s components, its internal interfaces, as well as the graphical user interface.
Marian Bubak, Włodzimierz Funika, Roland Wismüller

Dynamic Instrumentation for Java Using a Virtual JVM

Dynamic instrumentation, meaning modification of an application’s instructions at run-time in order to monitor its behaviour, is a very powerful foundation for a wide range of program manipulation tools. This paper concerns the problem of implementing dynamic instrumentation for a managed run-time environment such as a Java Virtual Machine (JVM). We present a flexible new approach based on a “virtual” JVM, which runs above a standard JVM but intercepts application control flow in order to allow it to be modified at run-time. Our Veneer Virtual JVM works by fragmenting each method’s bytecode at specified points (such as basic blocks). The fragmentation process can include static analysis passes which associate dependence and liveness metadata with each block in order to facilitate run-time optimisation. We conclude with some preliminary performance results, and discuss further applications of the tool.
Kwok Yeung, Paul H. J. Kelly, Sarah Bennett

Aksum: A Performance Analysis Tool for Parallel and Distributed Applications

Aksum is a multi-experiment performance analysis tool for message passing, shared memory and mixed parallelism programs; it automatically instruments the user’s application, generates versions of this application using a set of user-supplied input parameters, collect the data generated by the instrumentation and analyzes it, relates the performance problems back to the source code, and compares the performance behavior across multiple experiments.
Aksum automatically searches for performance bottlenecks based on the concept of performance properties. In contrast to much existing work, performance properties are normalized (values between 0 for the best case and 1 for the worst case), enabling the user to interpret the resulting performance behavior. Aksum is highly customizable, which allows the user to build or define his own performance tool. Performance properties are defined in JavaPSL, and may be freely edited, removed from or added to Aksum in order to customize and speedup the search process. The performance properties found can be grouped, filtered, and displayed in several dimensions. Experiments with a material science code are shown in order to demonstrate the usefulness of our approach.
Thomas Fahringer, Clovis Seragiotto

Grid Performance and Applications


Commercial Applications of Grid Computing

This paper provides an overview of commercial applications of Grid computing. We discuss Web performance and present a Grid caching architecture. Our Grid caching architecture offloads requests to Grid caches when Web servers become overloaded. We describe performance and traffic modeling techniques which can enhance Grid applications such as caching. We also discuss how Grid computing can be applied to financial applications. A key requirement here is that fast response times are needed. We present a Grid services scheduler that is well suited to commercial applications requiring fast response times.
Catherine Crawford, Daniel Dias, Arun Iyengar, Marcos Novaes, Li Zhang

Mesh Generation and Optimistic Computation on the Grid

This paper describes the concept of optimistic grid computing. This allows applications to synchronize more loosely and better tolerate the dynamic and heterogeneous bandwidths and latencies that are seen in grid environments. Based on the observed performance of a world-wide grid testbed, we estimate target operating regions for grid applications. Mesh generation is the primary test application where boundary mesh cavities can be optimistically expanded in parallel. To manage the level of optimistic execution and stay within the application’s operating region, we are integrating grid performance monitoring and prediction into the supporting runtime system. The ultimate goal of this project is to generalize the experience and knowledge of optimistic grid computing gained through mesh generation into a tool that can be applied to other tightly coupled computations in other application domains.
Nikos Chrisochoides, Craig Lee, Bruce Lowekamp

Grid Performance and Resource Management Using Mobile Agents

Mobile agents provide an important paradigm for supporting dynamic services in Computational Grids. We outline reasons why mobile agents are useful and how they can provide support for resource discovery and performance management in the context of service-oriented Grids. We also discuss factors which are likely to limit the uptake of the mobile agent approach, and how some of these restrictions can be overcome. This approach is subsequently exemplified by means of a mobile agent based programming and execution framework, the MAGDA system.
Beniamino Di Martino, Omer F. Rana

Monitoring of Interactive Grid Applications

This paper presents the OCM-G, a Grid application monitoring system. The OCM-G is aimed to provide services via which tools supporting application development are enabled to gather information, manipulate, and detect events that occur when applications are running. The functionality of the OCM-G is available via a standardized interface, On-line Monitoring Interface Specification (OMIS). The OCM-G is designed to work in a Grid environment. This implies a distributed and decentralized design which allows for a large-scale scalability and capability to handle multiple applications, users and tools at the same time, while ensuring security. The design of the OCM-G assumes that one part of it is permanent which allows it to work as a Grid service and additionally enables communication through firewalls, whereas another part is transient and private to each Grid user, what solves the major security problems. In the paper, we provide a short overview of OMIS, describe the design of the OCM-G and discuss Grid-specific requirements including necessary OMIS extensions as well as security issues.
Bartosz Baliś, Marian Bubak, Włodzimierz Funika, Tomasz Szepieniec, Roland Wismüller

The Unicore Grid and Its Options for Performance Analysis

UNICORE (Uniform Interface to Computer Resources) is a software infrastructure to support seamless and secure access to distributed resources. It has been developed by the projects UNICORE and UNICORE Plus in 1997 – 2002 (funded by the German Ministry of Education and Research) and is going to be enhanced in the EU-funded projects EUROGRID and GRIP. The UNICORE system allows uniform access to different hardware and software platforms as well as different organizational environments. The core part is the abstract job model. The abstract job specification is translated into a concrete batch job for the target system. Besides others, application specific support is a major feature of the system. By exploiting the plugin mechanism, support for performance analysis of Grid applications can be added. As an example, support of Vampirtrace has been integrated. The UNICORE user interface then gives the option to add a task using Vampirtrace with runtime configuration support into a UNICORE job and retrieve the generated trace files for local visualization. Together with the support for compilation and linkage and for metacomputing, the plugin mechanism may be used to integrate other performance analysis tools in future.
Sven Haubold, Hartmut Mix, Wolfgang E. Nagel, Mathilde Romberg


Weitere Informationen