Top

2015 | Book

Read chapter Read first chapter

Tools for High Performance Computing 2014

Proceedings of the 8th International Workshop on Parallel Tools for High Performance Computing, October 2014, HLRS, Stuttgart, Germany

Editors: Christoph Niethammer, José Gracia, Andreas Knüpfer, Michael M. Resch, Wolfgang E. Nagel

Publisher: Springer International Publishing

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

Numerical simulation and modelling using High Performance Computing has evolved into an established technique in academic and industrial research. At the same time, the High Performance Computing infrastructure is becoming ever more complex. For instance, most of the current top systems around the world use thousands of nodes in which classical CPUs are combined with accelerator cards in order to enhance their compute power and energy efficiency. This complexity can only be mastered with adequate development and optimization tools. Key topics addressed by these tools include parallelization on heterogeneous systems, performance optimization for CPUs and accelerators, debugging of increasingly complex scientific applications and optimization of energy usage in the spirit of green IT. This book represents the proceedings of the 8th International Parallel Tools Workshop, held October 1-2, 2014 in Stuttgart, Germany – which is a forum to discuss the latest advancements in the parallel tools.

Frontmatter

Scalasca v2: Back to the Future

Abstract

Scalasca is a well-established open-source toolset that supports the performance optimization of parallel programs by measuring and analyzing their runtime behavior. The analysis identifies potential performance bottlenecks – in particular those concerning communication and synchronization – and offers guidance in exploring their causes. The latest Scalasca v2 release series is based on the community instrumentation and measurement infrastructure Score-P, which is jointly developed by a consortium of partners from Germany and the US. This significantly improves interoperability with other performance analysis tool suites such as Vampir and TAU due to the usage of the two common data formats CUBE4 for profiles and the Open Trace Format 2 (OTF2) for event trace data. This paper will showcase recent as well as ongoing enhancements, such as support for additional platforms (K computer, Intel Xeon Phi) and programming models (POSIX threads, MPI-3, OpenMP4), and features like the critical-path analysis. It also summarizes the steps necessary for users to migrate from Scalasca v1 to Scalasca v2.

Ilya Zhukov, Christian Feld, Markus Geimer, Michael Knobloch, Bernd Mohr, Pavel Saviankou

Allinea MAP: Adding Energy and OpenMP Profiling Without Increasing Overhead

Abstract

Allinea MAP was introduced in 2013 as a highly scalable, commercially-supported sampling-based MPI profiler that tracks performance data over time and relates it directly to the program source code. We have since extended its capabilities to support profiling of OpenMP regions and POSIX threads (pthreads) in general. We will show the principles we used to highlight the balance between multi-core (OpenMP) computation, MPI communication and serial code in Allinea MAP’s updated GUI. Graphs detailing performance metrics (memory, IO, vectorised operations etc.) complete the performance profile. We have also added power-usage metrics to Allinea MAP and are actively seeking collaboration with vendors, application users and other tools writers to define how best HPC can meet the power requirements moving towards exascale. MAP’s data is provided for export to other tools and analysis in an open XML-based format.

Christopher January, Jonathan Byrd, Xavier Oró, Mark O’Connor

DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities

Abstract

The stagnation of single-core performance leaves application developers with software parallelism as the only option to further benefit from Moore’s Law. However, in view of the complexity of writing parallel programs, the parallelization of myriads of sequential legacy programs presents a serious economic challenge. A key task in this process is the identification of suitable parallelization targets in the source code. We have developed a tool called DiscoPoP showing how dependency profiling can be used to automatically identify potential parallelism in sequential programs. Our method is based on the notion of computational units, which are small sections of code following a read-compute-write pattern that can form the atoms of concurrent scheduling. DiscoPoP covers both loop and task parallelism. Experimental results show that reasonable speedups can be achieved by parallelizing sequential programs manually according to our findings. By comparing our findings to known parallel implementations of sequential programs, we demonstrate that we are able to detect the most important code locations to be parallelized.

Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, Felix Wolf

Tareador: The Unbearable Lightness of Exploring Parallelism

Abstract

The appearance of multi/many-core processors created a gap between the parallel hardware and sequential software. Furthermore, this gap keeps increasing, since the community cannot find an appealing solution for parallelizing applications. We propose Tareador as a mean for fighting this problem. Tareador is a tool that helps a programmer explore various parallelization strategies and find the one that exposes the highest potential parallelism. Tareador dynamically instruments a sequential application, automatically detects data-dependencies between sections of execution, and evaluates the potential parallelism of different parallelization strategies. Furthermore, Tareador includes the automatic search mechanism that explores parallelization strategies and leads to the optimal one. Finally, we blueprint how Tareador could be used together with the parallel programming model and the parallelization workflow in order to facilitate parallelization of applications.

Vladimir Subotic, Arturo Campos, Alejandro Velasco, Eduard Ayguade, Jesus Labarta, Mateo Valero

Tuning Plugin Development for the Periscope Tuning Framework

Abstract

Periscope, the automatic performance analysis tool, was extended in the European AutoTune project to support automatic tuning. As part of the extension, the tool provides a framework for the development of automatic tuners. The Periscope Tuning Framework (PTF) facilitates the development of advanced tuning plugins by providing the Tuning Plugin Interface (TPI). The tuners are implemented as plugins that are loaded at runtime. These can access the performance analysis features of Periscope as well as its automatic experiment execution support. The partners in AutoTune developed tuning plugins for compiler flag selection, MPI library parameters, MPI IO, master/worker applications, parallel pattern applications, and energy efficiency. This presentation will outline the development of tuning plugins and gives examples from the plugins developed in AutoTune.

Isaías A. Comprés Ureña, Michael Gerndt

Combining Instrumentation and Sampling for Trace-Based Application Performance Analysis

Abstract

Performance analysis is vital for optimizing the execution of high performance computing applications. Today different techniques for gathering, processing, and analyzing application performance data exist. Application level instrumentation for example is a powerful method that provides detailed insight into an application’s behavior. However, it is difficult to predict the instrumentation-induced perturbation as it largely depends on the application and its input data. Thus, sampling is a viable alternative to instrumentation for gathering information about the execution of an application by recording its state at regular intervals. This method provides a statistical overview of the application execution and its overhead is more predictable than with instrumentation. Taking into account the specifics of these techniques, this paper makes the following contributions: (I) A comprehensive overview of existing techniques for application performance analysis. (II) A novel tracing approach that combines instrumentation and sampling to offer the benefits of complete information where needed with reduced perturbation. We provide examples using selected instrumentation and sampling methods to detail the advantage of such mixed information and discuss arising challenges and prospects of this approach.

Thomas Ilsche, Joseph Schuchart, Robert Schöne, Daniel Hackenberg

Ocelotl: Large Trace Overviews Based on Multidimensional Data Aggregation

Abstract

Performance analysis of parallel applications is commonly based on execution traces that might be investigated through visualization techniques. The weak scalability of such techniques appears when traces get larger both in time (many events registered) and space (many processing elements), a very common situation for current large-scale HPC applications. In this paper we present an approach to tackle such scenarios in order to give a correct overview of the behavior registered in very large traces. Two configurable and controlled aggregation-based techniques are presented: one based exclusively on the temporal aggregation, and another that consists in a spatiotemporal aggregation algorithm. The paper also details the implementation and evaluation of these techniques in Ocelotl, a performance analysis and visualization tool that overcomes the current graphical and interpretation limitations by providing a concise overview registered on traces. The experimental results show that Ocelotl helps in detecting quickly and accurately anomalies in 8 GB traces containing up to 200 million of events.

Damien Dosimont, Youenn Corre, Lucas Mello Schnorr, Guillaume Huard, Jean-Marc Vincent

Integrating Critical-Blame Analysis for Heterogeneous Applications into the Score-P Workflow

Abstract

High performance computing (HPC) systems increasingly deploy accelerators and coprocessors to achieve maximum performance combined with high energy efficiency. Thus, application design for such large-scale heterogeneous clusters often requires to utilize multiple programming models that scale both within and across nodes and accelerators. To assist programmers in the complex task of application development and optimization, sophisticated performance analysis tools are necessary. It has been shown that CASITA, an analysis tool for complex MPI, OpenMP and CUDA applications, is able to effectively identify valuable optimization targets by means of critical-blame analysis for applications utilizing multiple programming models. This paper presents the integration of CASITA into the Score-P tool infrastructure. We depict the complete Score-P measurement and analysis workflow, including the performance data collection for the CUDA, OpenMP and MPI programming models, tracking of dependencies between work performed on the host and on the accelerator as well as waiting-time and critical-blame analysis with CASITA and visualization of analysis results in Vampir.

Felix Schmitt, Robert Dietrich, Jonas Stolle

Studying Performance Changes with Tracking Analysis

Abstract

Scientific applications can have so many parameters, possible usage scenarios and target architectures, that a single experiment is often not enough for an effective analysis that gets sound understanding of their performance behavior. Different software and hardware settings may have a strong impact on the results, but trying and measuring in detail even just a few possible combinations to decide which configuration is better, rapidly floods the user with excessive amounts of information to compare.

In this chapter we introduce a novel methodology for performance analysis based on object tracking techniques. The most compute-intensive parts of the program are automatically identified via cluster analysis, and then we track the evolution of these regions across different experiments to see how the behavior of the program changes with respect to the varying settings and over time. This methodology addresses an important problem in HPC performance analysis, where the volume of data that can be collected expands rapidly in a potentially high dimensional space of performance metrics, and we are able to manage this complexity and identify coarse properties that change when parameters are varied to target tuning and more detailed performance studies.

Germán Llort, Harald Servat, Juan Gonzalez, Judit Gimenez, Jesús Labarta

A Flexible Data Model to Support Multi-domain Performance Analysis

Abstract

Performance data can be complex and potentially high dimensional. Further, it can be collected in multiple, independent domains. For example, one can measure code segments, hardware components, data structures, or an application’s communication structure. Performance analysis and visualization tools require access to this data in an easy way and must be able to specify relationships and mappings between these domains in order to provide users with intuitive, actionable performance analysis results.

In this paper, we describe a data model that can represent such complex performance data, and we discuss how this model helps us to specify mappings between domains. We then apply this model to several use cases both for data acquisition and how it can be mapped into the model, and for performance analysis and how it can be used to gain insight into an application’s performance.

Martin Schulz, Abhinav Bhatele, David Böhme, Peer-Timo Bremer, Todd Gamblin, Alfredo Gimenez, Kate Isaacs

Title: Tools for High Performance Computing 2014
Editors: Christoph Niethammer
José Gracia
Andreas Knüpfer
Michael M. Resch
Wolfgang E. Nagel
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-16012-2
Print ISBN: 978-3-319-16011-5
DOI: https://doi.org/10.1007/978-3-319-16012-2

Springer Professional

Tools for High Performance Computing 2014

Proceedings of the 8th International Workshop on Parallel Tools for High Performance Computing, October 2014, HLRS, Stuttgart, Germany

About this book

Table of Contents

Frontmatter

Scalasca v2: Back to the Future

Allinea MAP: Adding Energy and OpenMP Profiling Without Increasing Overhead

DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities

Tareador: The Unbearable Lightness of Exploring Parallelism

Tuning Plugin Development for the Periscope Tuning Framework

Combining Instrumentation and Sampling for Trace-Based Application Performance Analysis

Ocelotl: Large Trace Overviews Based on Multidimensional Data Aggregation

Integrating Critical-Blame Analysis for Heterogeneous Applications into the Score-P Workflow

Studying Performance Changes with Tracking Analysis

A Flexible Data Model to Support Multi-domain Performance Analysis

Premium Partner