nach oben

2017 | Buch

Tools for High Performance Computing 2016

Proceedings of the 10th International Workshop on Parallel Tools for High Performance Computing, October 2016, Stuttgart, Germany

herausgegeben von: Christoph Niethammer, José Gracia, Tobias Hilbrich, Andreas Knüpfer, Michael M. Resch, Wolfgang E. Nagel

Verlag: Springer International Publishing

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book presents the proceedings of the 10th International Parallel Tools Workshop, held October 4-5, 2016 in Stuttgart, Germany – a forum to discuss the latest advances in parallel tools.

High-performance computing plays an increasingly important role for numerical simulation and modelling in academic and industrial research. At the same time, using large-scale parallel systems efficiently is becoming more difficult. A number of tools addressing parallel program development and analysis have emerged from the high-performance computing community over the last decade, and what may have started as collection of small helper script has now matured to production-grade frameworks. Powerful user interfaces and

Inhaltsverzeichnis

Frontmatter

Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels

Abstract

Achieving optimal program performance requires deep insight into the interaction between hardware and software. For software developers without an in-depth background in computer architecture, understanding and fully utilizing modern architectures is close to impossible. Analytic loop performance modeling is a useful way to understand the relevant bottlenecks of code execution based on simple machine models. The Roofline Model and the Execution-Cache-Memory (ECM) model are proven approaches to performance modeling of loop nests. In comparison to the Roofline model, the ECM model can also describes the single-core performance and saturation behavior on a multicore chip.We give an introduction to the Roofline and ECM models, and to stencil performance modeling using layer conditions (LC). We then present Kerncraft, a tool that can automatically construct Roofline and ECM models for loop nests by performing the required code, data transfer, and LC analysis. The layer condition analysis allows to predict optimal spatial blocking factors for loop nests. Together with the models it enables an ab-initio estimate of the potential benefits of loop blocking optimizations and of useful block sizes. In cases where LC analysis is not easily possible, Kerncraft supports a cache simulator as a fallback option. Using a 25-point long-range stencil we demonstrate the usefulness and predictive power of the Kerncraft tool.

Julian Hammer, Jan Eitzinger, Georg Hager, Gerhard Wellein

Defining and Searching Communication Patterns in Event Graphs Using the g-Eclipse Trace Viewer Plugin

Abstract

The use of event graphs is a common approach to debug and analyze message passing parallel programs. Although event graphs are very useful for program understanding and debugging, they get confusing and hard to read for programs with complex communication behavior, long runtimes and a large numbers of processes. An approach to ease this problem is to simplify the event graph by marking occurrences of predefined well known communication structures. This allows to quickly identify different regions of activity in the event graph without further inspection. It also helps to identify parts, where certain communication patterns are expected but do not occur due to a bug in the parallel application, in this case the pattern might only match to a certain degree. In this paper we present a language for the description of such communication patterns, which allows to describe the patterns in a way that also covers variations in process numbers and process mappings. Furthermore it demonstrates a pattern matching plugin for the Trace Viewer of g-Eclipse which uses an specialized algorithm for detecting patterns in prerecorded event traces of parallel programs. Based on the presented approach a variety of improvements for the processing and presentation of event graphs are imaginable. The extracted pattern information could be used to optimize the analyzed program or to reduce the contents of the graph to areas of interest, by substituting non interesting parts by placeholders.

Thomas Köckerbauer, Dieter Kranzlmüller

Monitoring Heterogeneous Applications with the OpenMP Tools Interface

Abstract

Heterogeneous systems are gaining more importance in supercomputing, yet they are challenging to program and developers require support tools to understand how well their accelerated codes perform and how they can be improved. The OpenMP Tools Interface (OMPT) is a new performance monitoring interface that is being considered for integration into the OpenMP standard. OMPT allows monitoring the execution of heterogeneous OpenMP applications by revealing the activity of the runtime through a standardized API as well as facilitating the exchange of performance information between devices with accelerated codes, and the analysis tool. In this paper we describe our efforts implementing parts of the OMPT specification necessary to monitor accelerators. In particular, the integration of the OMPT features to our parallel runtime system and instrumentation framework helps to obtain detailed performance information about the execution of the accelerated tasks issued to the devices to allow an insightful analysis. As a result of this analysis, the parallel runtime of the programming model has been improved. We focus on the evaluation of monitoring FPGA devices studying the performance of a common kernel in scientific algorithms: matrix multiplication. Nonetheless, this development is as well applicable to monitor GPU accelerators and Intel®; Xeon Phi^TM co-processors operating under the OmpSs programming model.

Michael Wagner, Germán Llort, Antonio Filgueras, Daniel Jiménez-González, Harald Servat, Xavier Teruel, Estanislao Mercadal, Carlos Álvarez, Judit Giménez, Xavier Martorell, Eduard Ayguadé, Jesús Labarta

Extending the Functionality of Score-P Through Plugins: Interfaces and Use Cases

Abstract

Performance measurement and runtime tuning tools are both vital in the HPC software ecosystem and use similar techniques: the analyzed application is interrupted at specific events and information on the current system state is gathered to be either recorded or used for tuning. One of the established performance measurement tools is Score-P. It supports numerous HPC platforms and parallel programming paradigms. To extend Score-P with support for different back-ends, create a common framework for measurement and tuning of HPC applications, and to enable the re-use of common software components such as implemented instrumentation techniques, this paper makes the following contributions: (1) We describe the Score-P metric plugin interface, which enables programmers to augment the event stream with metric data from supplementary data sources that are otherwise not accessible for Score-P. (2) We introduce the flexible Score-P substrate plugin interface that can be used for custom processing of the event stream according to the specific requirements of either measurement, analysis, or runtime tuning tasks. (3) We provide examples for both interfaces that extend Score-P’s functionality for monitoring and tuning purposes.

Robert Schöne, Ronny Tschüter, Thomas Ilsche, Joseph Schuchart, Daniel Hackenberg, Wolfgang E. Nagel

Debugging Latent Synchronization Errors in MPI-3 One-Sided Communication

Abstract

The Message Passing Interface (MPI-3) provides a one-sided communication interface, also known as MPI Remote Memory Access (RMA), which enables one process to specify all required communication parameters for both the sending and receiving side. While this communication interface enables superior performance potential developers have to deal with a complex memory consistency model. Proper synchronization of asynchronous remote memory accesses to shared data structures is a challenging task. More importantly, it is difficult to pinpoint such synchronization bugs as they do not necessarily manifest in an error or occur for example only after porting the application to a different HPC environment.

We introduce a debugging tool to support the detection of latent synchronization bugs. Based on the semantic flexibility of the MPI-3 specification we dynamically modify executions of improperly synchronized MPI remote memory accesses to force a manifestation of an error. An experimental evaluation with small applications and the usage in a library which heavily relies on MPI RMA reveal that this approach can uncover synchronization bugs which would otherwise likely go unnoticed.

Roger Kowalewski, Karl Fürlinger

Trace-Based Detection of Lock Contention in MPI One-Sided Communication

Abstract

Performance analysis is an essential part of the development process of HPC applications. Thus, developers need adequate tools to evaluate design and implementation decisions to effectively develop efficient parallel applications. Therefore, it is crucial that tools provide an as complete support as possible for the available language and library features to ensure that design decisions are not negatively influenced by the level of available tool support. The message passing interface (MPI) supports three basic communication paradigms: point-to-point, collective, and one-sided. Each of these targets and excels at a specific application scenario. While current performance tools support the first two quite well, one-sided communication is often neglected. In our earlier work, we were able to reduce this gap by showing how wait states in MPI one-sided communication using active-target synchronization can be detected at large scale using our trace-based message replay technique. Further extending our work on the detection of progress-related wait states in ARMCI, this paper presents an improved infrastructure that is capable of not only detecting progress-related wait states, but also wait states due to lock contention in MPI passive-target synchronization. We present an event-based definition of lock contention, the trace-based algorithm to detect it, as well as initial results with a micro-benchmark and an application kernel scaling up to 65,536 processes.

Marc-André Hermanns, Markus Geimer, Bernd Mohr, Felix Wolf

Machine Learning-Driven Automatic Program Transformation to Increase Performance in Heterogeneous Architectures

Abstract

We present a program transformation approach to convert procedural code into functionally equivalent code adapted to a given platform. Our framework is based on the application of guarded transformation rules that capture semantic conditions to ensure the soundness of their application. Our goal is to determine a sequence of rule applications which transform some initial code into final code which optimizes some non-functional properties. The code to be transformed is adorned with semantic annotations, either provided by the user or by external analysis tools. These annotations give information to decide whether applying a transformation rule is or is not sound. In general, there are several rules applicable at several program points and, besides, transformation sequences do not monotonically change the optimization function. Therefore, we face a search problem that grows exponentially with the length of the transformation sequence. In our experience with even small examples, that becomes impractical very quickly. In order to effectively deal with this issue, we have adopted a machine-learning approach using classification trees and reinforcement learning. It learns from successful transformation sequences and produces encodings of strategies which can provide long-term rewards for a given characteristic, avoiding local minima. We have evaluated the proposed technique in a series of benchmarks, adapting standard C code to GPU execution via OpenCL. We have found the automatically produced code to be as efficient as hand-written code generated by an expert human programmer.

Salvador Tamarit, Guillermo Vigueras, Manuel Carro, Julio Mariño

Titel: Tools for High Performance Computing 2016
herausgegeben von: Christoph Niethammer
José Gracia
Tobias Hilbrich
Andreas Knüpfer
Michael M. Resch
Wolfgang E. Nagel
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-56702-0
Print ISBN: 978-3-319-56701-3
DOI: https://doi.org/10.1007/978-3-319-56702-0

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels

Defining and Searching Communication Patterns in Event Graphs Using the g-Eclipse Trace Viewer Plugin

Monitoring Heterogeneous Applications with the OpenMP Tools Interface

Extending the Functionality of Score-P Through Plugins: Interfaces and Use Cases

Debugging Latent Synchronization Errors in MPI-3 One-Sided Communication

Trace-Based Detection of Lock Contention in MPI One-Sided Communication

Machine Learning-Driven Automatic Program Transformation to Increase Performance in Heterogeneous Architectures

Premium Partner