Skip to main content

2012 | Buch

Tools for High Performance Computing 2011

Proceedings of the 5th International Workshop on Parallel Tools for High Performance Computing, September 2011, ZIH, Dresden

herausgegeben von: Holger Brunst, Matthias S. Müller, Wolfgang E. Nagel, Michael M. Resch

Verlag: Springer Berlin Heidelberg

insite
SUCHEN

Über dieses Buch

The proceedings of the 5th International Workshop on Parallel Tools for High Performance Computing provide an overview on supportive software tools and environments in the fields of System Management, Parallel Debugging and Performance Analysis. In the pursuit to maintain exponential growth for the performance of high performance computers the HPC community is currently targeting Exascale Systems. The initial planning for Exascale already started when the first Petaflop system was delivered. Many challenges need to be addressed to reach the necessary performance. Scalability, energy efficiency and fault-tolerance need to be increased by orders of magnitude. The goal can only be achieved when advanced hardware is combined with a suitable software stack. In fact, the importance of software is rapidly growing. As a result, many international projects focus on the necessary software.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Creating a Tool Set for Optimizing Topology-Aware Node Mappings
Abstract
Modern HPC systems, such as Cray’s XE and IBM’s Blue Gene line, feature sophisticated network architectures, often in the form of high dimensional tori. In order to fully exploit the performance of these systems, it is necessary to carefully map an application’s communication structure to the underlying network topology. In this step, both latency (i.e., physical distance between nodes) and bandwidth (i.e., number of concurrently used links) have to be taken into account, leading to mappings that are often non-intuitive. To help developers with this complex problem, we are developing a set of tools that aim at helping users understand the communication behavior of their codes, map them onto the network architecture, and create better-performing topology-aware node mappings. In this paper, we present initial steps towards this goal, including a measurement environment capturing both communication patterns and network metrics within the same run, a methodology to compare these measurements, and a visualization tool that helps users understand the impact of their application’s characteristics on the network behavior.
Martin Schulz, Abhinav Bhatele, Peer-Timo Bremer, Todd Gamblin, Katherine Isaacs, Joshua A. Levine, Valerio Pascucci
Chapter 2. Using Sampling to Understand Parallel Program Performance
Abstract
Developing scalable parallel applications for extreme-scale systems is challenging. The challenge of developing scalable parallel applications is only partially addressed by existing languages, compilers, and autotuners. As a result, manual performance tuning is often necessary to obtain high application performance. Rice University’s HPCToolkit is a suite of performance tools that supports innovative techniques for pinpointing and quantifying performance bottlenecks in fully optimized parallel programs with a measurement overhead of only a few percent. Many of these techniques were designed to leverage sampling for performance measurement, attribution, analysis, and presentation. This paper surveys some of HPCToolkit’s most interesting techniques and argues that sampling-based performance analysis is surprisingly versatile and effective.
Nathan R. Tallent, John Mellor-Crummey
Chapter 3. likwid-bench: An Extensible Microbenchmarking Platform for x86 Multicore Compute Nodes
Abstract
Microbenchmarking is an essential tool for characterizing modern compute nodes. Apart from determining raw performance capabilities microbenchmarking can be used to aquire input parameters for performance models or mimic the behavior of more complex applications. Many existing microbenchmarks are not extensible and implemented in C or Fortran. One problem with microbenchmarks in a high level language is that many performance issues are only apparent on the instruction level. The code quality of the compiler is an additional source of variation. likwid-bench is a framework enabling rapid prototyping of loop-based, threaded assembly kernels. It eases the process of implementing assembly kernels by providing a portable assembly language independent from any concrete assembler program. likwid-bench already includes many standard microbenchmarking testcases and can be used out of the box as a microbenchmarking tool.
Jan Treibig, Georg Hager, Gerhard Wellein
Chapter 4. An Open-Source Tool-Chain for Performance Analysis
Abstract
Modern supercomputers with multi-core nodes enhanced by accelerators as well as hybrid programming models introduce more complexity in modern applications. Efficiently exploiting all of the available resources requires a complex performance analysis of applications in order to detect time-consuming or idle sections. This paper presents an open-source tool-chain for analyzing the performance of parallel applications. It is composed of a trace generation framework called EZTrace, a generic interface for writing traces in multipe formats called GTG, and a trace visualizer called ViTE. These tools cover the main steps of performance analysis – from the instrumentation of applications to the trace analysis – and are designed to maximize the compatibility with other performance analysis tools. Thus, these tools support multiple file formats and are not bound to a particular programming model. The evaluation of these tools show that they provide similar performance compared to other analysis tools.
Kevin Coulomb, Augustin Degomme, Mathieu Faverge, François Trahay
Chapter 5. Debugging CUDA Accelerated Parallel Applications with TotalView
Abstract
CUDA introduces developers to a number of concepts (such as kernels, streams, warps and explicitly multi-level memory) beyond what they are used to in serial, parallel and multi-threaded applications. Visibility into these elements is critical for troubleshooting and tuning applications that make use of CUDA. This paper will highlight CUDA concepts implemented in CUDA 3.0–4.0, the complications they introduce for troubleshooting, and how TotalView helps the user deal with these new CUDA specific constructs.
Chris Gottbrath, Royd Lüdtke
Chapter 6. Advanced Memory Checking Frameworks for MPI Parallel Applications in Open MPI
Abstract
In this paper, we describe the implementation of memory checking functionality that is based on instrumentation tools. The combination of instrumentation based checking functions and the MPI-implementation offers superior debugging functionalities, for errors that otherwise are not possible to detect with comparable MPI-debugging tools. Our implementation contains three parts: first, a memory callback extension that is implemented on top of the Valgrind Memcheck tool for advanced memory checking in parallel applications; second, a new instrumentation tool was developed based on the Intel Pin framework, which provides similar functionality as Memcheck it can be used in Windows environments that have no access to the Valgrind suite; third, all the checking functionalities are integrated as the so-called memchecker framework within Open MPI. This will also allow other memory debuggers that offer a similar API to be integrated. The tight control of the user’s memory passed to Open MPI, allows us to detect application errors and to track bugs within Open MPI itself. The extension of the callback mechanism targets communication buffer checks in both pre- and post-communication phases, in order to analyze the usage of the received data, e.g. whether the received data has been overwritten before it is used in an computation or whether the data is never used. We describe our actual checks, classes of errors being found, how memory buffers are being handled internally, show errors actually found in user’s code, and the performance implications of our instrumentation.
Shiqing Fan, Rainer Keller, Michael Resch
Chapter 7. Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir
Abstract
This paper gives an overview about the Score-P performance measurement infrastructure which is being jointly developed by leading HPC performance tools groups. It motivates the advantages of the joint undertaking from both the developer and the user perspectives, and presents the design and components of the newly developed Score-P performance measurement infrastructure. Furthermore, it contains first evaluation results in comparison with existing performance tools and presents an outlook to the long-term cooperative development of the new system.
Andreas Knüpfer, Christian Rössel, Dieter an Mey, Scott Biersdorff, Kai Diethelm, Dominic Eschweiler, Markus Geimer, Michael Gerndt, Daniel Lorenz, Allen Malony, Wolfgang E. Nagel, Yury Oleynik, Peter Philippen, Pavel Saviankou, Dirk Schmidl, Sameer Shende, Ronny Tschüter, Michael Wagner, Bert Wesarg, Felix Wolf
Chapter 8. Trace-Based Performance Analysis for Hardware Accelerators
Abstract
Hardware accelerators have changed the HPC landscape as they open a potential route to exascale computing. At the same time they also complicate the tasl of application development since they introduce another level of parallelism and, thus, complexity. Performance tool support to aid the developer will be a necessity. While profiling is offered by the accelerator vendors, tracing tools can also adopt hardware accelerators. A number of challenges for data acquisition and visualization and their solutions are presented in this paper.
Guido Juckeland
Chapter 9. Folding: Detailed Analysis with Coarse Sampling
Abstract
Performance analysis tools help the application users to find bottlenecks that prevent the application to run at full speed in current supercomputers. The level of detail and the accuracy of the performance tools are crucial to completely depict the nature of the bottlenecks. The details exposed do not only depend on the nature of the tools (profile-based or trace-based) but also on the mechanism on which they rely (instrumentation or sampling) to gather information.In this paper we present a mechanism called folding that combines both instrumentation and sampling for trace-based performance analysis tools. The folding mechanism takes advantage of long execution runs and low frequency sampling to finely detail the evolution of the user code with minimal overhead on the application. The reports provided by the folding mechanism are extremely useful to understand the behavior of a region of code at a very low level. We also present a practical study we have done in a in-production scenario with the folding mechanism and show that the results of the folding resembles to high frequency sampling.
Harald Servat, Germán Llort, Judit Giménez, Kevin Huck, Jesús Labarta
Chapter 10. Advances in the TAU Performance System
Abstract
Evolution and growth of parallel systems requires continued advances in the tools to measure, characterize, and understand parallel performance. Five recent developments in the TAU Performance System are reported. First, an update is given on support for heterogeneous systems with GPUs. Second, event-based sampling is being integrated in TAU to add new capabilities for performance observation. New wrapping technology has been incorporated in TAU’s instrumentation harness, increasing observation scope. The fourth advance is in the area of performance visualization. Lastly, we discuss our work in Eclipse Parallel Tools Platform.
Allen Malony, Sameer Shende, Wyatt Spear, Chee Wai Lee, Scott Biersdorff
Chapter 11. Temanejo: Debugging of Thread-Based Task-Parallel Programs in StarSS
Abstract
To make use of manycore processors and even accelerators, several parallel programming paradigms exist, such as OpenMP, CAPS HMPP and the StarSs programming model. All of these programming models provide the means for programmers to express parallelism in the source code, identifying tasks and for all but OpenMP the dependency between those, allowing the compiler and the runtime to schedule tasks onto multiple concurrent executing entities, like threads in a many-core systems. While the programmer may have a good overview of which parts of the code may be run independently as separate tasks on a fine granular level, the overall execution behavior may not be obvious at first. This paper describes the usability features of the newly developed Temanejo debugger.
Rainer Keller, Steffen Brinkmann, José Gracia, Christoph Niethammer
Chapter 12. HiFlow3: A Hardware-Aware Parallel Finite Element Package
Abstract
The goal of this paper is to describe the hardware-aware parallel C++ finite element package HiFlow3. HiFlow3 aims at providing a powerful platform for simulating processes modelled by partial differential equations. Our vision is to solve boundary value problems in an appropriate way by coupling numerical simulations with modern software design and state-of-the-art hardware technologies. The main functionalities for mapping the mathematical model into parallel software are implemented in the three core modules Mesh, DoF/FEM and Linear Algebra (LA). Parallelism is realized on two levels. The modules provide efficient MPI-based distributed data structures to achieve performance on large HPC systems but also on stand-alone workstations. Additionally, the hardware-aware cross-platform approach in the LA module accelerates the solution process by exploiting the computing power from emerging technologies like multi-core CPUs and GPUs. In this context performance evaluation on different hardware-architectures will be demonstrated.
H. Anzt, W. Augustin, M. Baumann, T. Gengenbach, T. Hahn, A. Helfrich-Schkarbanenko, V. Heuveline, E. Ketelaer, D. Lukarski, A. Nestler, S. Ritterbusch, S. Ronnas, M. Schick, M. Schmidtobreick, C. Subramanian, J.-P. Weiss, F. Wilhelm, M. Wlotzka
Backmatter
Metadaten
Titel
Tools for High Performance Computing 2011
herausgegeben von
Holger Brunst
Matthias S. Müller
Wolfgang E. Nagel
Michael M. Resch
Copyright-Jahr
2012
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-31476-6
Print ISBN
978-3-642-31475-9
DOI
https://doi.org/10.1007/978-3-642-31476-6

Neuer Inhalt