Skip to main content

Über dieses Buch

This book presents the proceedings of the 11th International Parallel Tools Workshop, a forum to discuss the latest advances in parallel tools, held September 11-12, 2017 in Dresden, Germany.

High-performance computing plays an increasingly important role for numerical simulation and modeling in academic and industrial research. At the same time, using large-scale parallel systems efficiently is becoming more difficult. A number of tools addressing parallel program development and analysis has emerged from the high-performance computing community over the last decade, and what may have started as a collection of a small helper scripts has now matured into production-grade frameworks. Powerful user interfaces and an extensive body of documentation together create a user-friendly environment for parallel tools.



A Structured Approach to Performance Analysis

Performance analysis tools are essential in the process of understanding application behavior, identifying critical performance issues and adapting applications to new architectures and increasingly scaling HPC systems. State-of-the-art tools provide extensive functionality and a plenitude of specialized analysis capabilities. At the same time, the complexity of the potential performance issues and sometimes the tools themselves remains a challenging task, especially for non-experts. In particular, identifying the main issues in the overwhelming amount of data and tool opportunities as well as quantifying their impact and potential for improvement can be tedious and time consuming. In this paper we present a structured approach to performance analysis used within the EU Centre of Excellence for Performance Optimization and Productivity (POP). The structured approach features a method to get a general overview, determine the focus of the analysis, and identify the main issues and areas for potential improvement with a statistical performance model that leads to starting points for a subsequent in-depth analysis. All steps of the structured approach are accompanied with according tools from the BSC tool suite and underlined with an exemplary performance analysis.
Michael Wagner, Stephan Mohr, Judit Giménez, Jesús Labarta

Counter Inspection Toolkit: Making Sense Out of Hardware Performance Events

Hardware counters play an essential role in understanding the behavior of performance-critical applications, and inform any effort to identify opportunities for performance optimization. However, because modern hardware is becoming increasingly complex, the number of counters that are offered by the vendors increases and, in some cases, so does their complexity. In this paper we present a toolkit that aims to assist application developers invested in performance analysis by automatically categorizing and disambiguating performance counters. We present and discuss the set of microbenchmarks and analyses that we developed as part of our toolkit. We explain why they work and discuss the non-obvious reasons why some of our early benchmarks and analyses did not work in an effort to share with the rest of the community the wisdom we acquired from negative results.
Anthony Danalis, Heike Jagode, Hanumantharayappa, Sangamesh Ragate, Jack Dongarra

ASSIST: An FDO Source-to-Source Transformation Tool for HPC Applications

The complexity and the diversity of computer architectures have dramaticaly evolved over the last decade, which makes it impossible to manually optimize codes for all these architectures. In addition, compilers must remain conservative with respect to their optimization choices because of their static cost model. One way to guide them is to use feedback data from data profiling of a representative training dataset (FDO/PGO) for a given application. It then becomes possible, based on that knowledge, to add specific compiler directives and/or flags to enhance performance. Moreover, automatic transformations simplifying portions of the application (e.g. specialization) can be applied. In this paper we present ASSIST, a directive-oriented source-to-source manipulation tool that aims at providing such assistance. The tool is integrated into the MAQAO toolset and takes advantage of all the available static and dynamic profiling data produced by the other tools. It also features a set of code transformations triggered by directives. The combination of both leads to an autotuning process that helps users to keep their code as generic as possible whilst also benefiting from a performance gain related to feedback or user knowledge. We demonstrate how we can build a compiler’s PGO-like tool and compare our first results to the Intel compiler PGO mode.
Youenn Lebras, Andres S. Charif Rubial, Romain Dolbeau, William Jalby

Unifying the Analysis of Performance Event Streams at the Consumer Interface Level

Several instrumentation interfaces have been developed for parallel programs to make observable actions that take place during execution and to make accessible information about the program’s behavior and performance. Following in the footsteps of the successful profiling interface for MPI (PMPI), new rich interfaces to expose internal operation of MPI (MPI-T) and OpenMP (OMPT) runtimes are now in the standards. Taking advantage of these interfaces requires tools to selectively collect events from multiples interfaces by various techniques: function interposition (PMPI), value read (MPI-T), and callbacks (OMPT). In this paper, we present the unified instrumentation pipeline proposed by the MALP infrastructure that can be used to forward a variety of fine-grained events from multiple interfaces online to multi-threaded analysis processes implemented orthogonally with plugins. In essence, our contribution complements “front-end” instrumentation mechanisms by a generic “back-end” event consumption interface that allows “consumer” callbacks to generate performance measurements in various formats for analysis and transport. With such support, online and post-mortem cases become similar from an analysis point of view, making it possible to build more unified and consistent analysis frameworks. The paper describes the approach and demonstrates its benefits with several use cases.
Jean-Baptiste Besnard, Allen D. Malony, Sameer Shende, Marc Pérache, Patrick Carribault, Julien Jaeger

OMPT-Multiplex: Nesting of OMPT Tools

In version 5.0 the OpenMP specification [4] will define a tool interface (OMPT) that allows monitoring tools to gain insights into implementation specific information about the execution behavior of an OpenMP application. The OMPT interface provides information about certain events during the execution, but also allows to query the OpenMP runtime about state and stack frame information.
Joachim Protze, Tim Cramer, Simon Convent, Matthias S. Müller

SCIPHI Score-P and Cube Extensions for Intel Phi

The Knights Landing processors offers unique features with regards to memory hierarchy and vectorization capabilities. To improve tool support within these two areas, we present extensions to the Score-P measurement infrastructure and the Cube report explorer. With the Knights Landing edition, Intel introduced a new memory architecture, utilizing two types of memory, MCDRAM and DDR4 SDRAM. To assist the user in the decision where to place data structures, we introduce a MCDRAM candidate metric to the Cube report explorer. In addition we track all MCDRAM allocations through the hbwmalloc interface, providing memory metrics like leaked memory or the high-water mark on a per-region basis, as already known for the ubiquitous malloc/free. A Score-P metric plugin that records memory statistics via numastat on a per process level enables a timeline analysis using the Vampir toolset. To get the best performance out of , the large vector processing units need to be utilized effectively. The ratio between computation and data access and the vector processing unit (VPU) intensity are introduced as metrics to identify vectorization candidates on a per-region basis. The Portable Hardware Locality (hwloc) Broquedis et al. (hwloc: a generic framework for managing hardware affinities in hpc applications, 2010 [2]) library allows us to visualize the distribution of the KNL-specific performance metrics within the Cube report explorer, taking the hardware topology consisting of processor tiles and cores into account.
Marc Schlütter, Christian Feld, Pavel Saviankou, Michael Knobloch, Marc-André Hermanns, Bernd Mohr

Towards Elastic Resource Management

A new paradigm for HPC Resource Management, called Elastic Computing, is under development at the Invasive Computing Transregional Collaborative Research Center. An extension to MPI for programming elastic applications and a resource manager were implemented. The resource manager is an extension of the SLURM batch scheduler. Resource elasticity allows the resource manager to dictate changes in the resource allocations of running applications based on scheduler decisions. These resource allocation changes are decided by the scheduler based on performance feedback from the applications. The collection of performance feedback from running applications poses unique challenges for the runtime system. In this document, our current performance feedback system is presented.
Isaías A. Comprés Ureña, Michael Gerndt

Online Performance Analysis with the Vampir Tool Set

Today, performance analysis of parallel applications is mandatory to fully exploit the capabilities of modern HPC systems. Many performance analysis tools are available to support users in this challenging task. All tools usually employ one of two analysis methodologies. The majority of analysis tools, such as HPCToolkit or Vampir, follow a post-mortem analysis approach. In this approach, a measurement infrastructure records performance data during the application execution and flushes its data to the file system. The tools perform subsequent analysis steps after the application execution by using the stored performance data. Post-mortem analysis comes with the disadvantage that possibly large data volumes need to be handled by the I/O subsystem of the machine. Tools following an online analysis approach mitigate this disadvantage by avoiding the I/O subsystem. The measurement infrastructure of these tools uses the network to directly transfer the recorded performance data to the analysis components of the tool. This approach, however, comes with the limitation that the complete analysis occurs at application runtime. In this work we present a prototype implementation of Vampir capable of performing online analysis. We discuss advantages and disadvantages of both approaches and draw conclusions for designing an online performance analysis tool.
Matthias Weber, Johannes Ziegenbalg, Bert Wesarg
Weitere Informationen

Premium Partner