main-content

This book contains the revised selected papers of 4 workshops held in conjunction with the International Conference on High Performance Computing, Networking, Storage and Analysis (SC) in November 2017 in Denver, CO, USA, and in November 2018 in Dallas, TX, USA: the 6th and 7th International Workshop on Extreme-Scale Programming Tools, ESPT 2017 and ESPT 2018, and the 4th and 5th International Workshop on Visual Performance Analysis, VPA 2017 and VPA 2018.

The 11 full papers of ESPT 2017 and ESPT 2018 and the 6 full papers of VPA 2017 and VPA 2018 were carefully reviewed and selected for inclusion in this book. The papers discuss the requirements for exascale-enabled tools as well as new approaches of applying visualization and visual analytic techniques to large-scale applications. Topics of interest include: programming tools; methodologies for performance engineering; tool technologies for extreme-scale challenges (e.g., scalability, resilience, power); tool support for accelerated architectures and large-scale multi-cores; tool infrastructures and environments; evolving/future application requirements for programming tools and technologies; application developer experiences with programming and performance tools; scalable displays of performance data; case studies demonstrating the use of performance visualization in practice; data models to enable scalable visualization; graph representation of unstructured performance data; presentation of high-dimensional data; visual correlations between multiple data sources; human-computer interfaces for exploring performance data; and multi-scale representations of performance data for visual exploration.

### ESPT 2017

#### Frontmatter

Abstract
The PAPI performance library is a widely used tool for gathering self-monitored performance data from running applications. A key aspect of self-monitoring is the ability to read hardware performance counters with minimum possible overhead. If read overhead becomes too large then the act of measurement will start to interfere with the gathered results, adversely affecting the performance analysis.
On Linux systems PAPI uses the perf_event subsystem to access the counter values via the read() system call. On x86 systems the special rdpmc instruction allows userspace measurement of counters without the overhead of entering the operating system kernel. We modify PAPI to use rdpmc rather than read() and find it typically improves the latency by at least a factor of three (and often a factor of six or more) on most modern systems. The improvement is even better on machines using a KPTI enabled kernel to avoid the Meltdown vulnerability. We analyze the effectiveness and limitations of the rdpmc interface and have gotten the rdpmc interface enabled by default in PAPI.
Yan Liu, Vincent M. Weaver

### Generic Library Interception for Improved Performance Measurement and Insight

Abstract
As applications grow in capability, they also grow in complexity. This complexity in turn gets pushed into modules and libraries. In addition, hardware configurations become increasingly elaborate, too. These two trends make understanding, debugging and analyzing the performance of applications more and more difficult.
To enable detailed insight into library usage of applications, we present an approach and implementation in Score-P that supports intuitive and robust creation of wrappers for arbitrary C/C++ libraries. Runtime analysis then uses these wrappers to keep track of how applications interact with libraries, how they interact with each other, and record the exact timing of their functions.
Ronny Brendel, Bert Wesarg, Ronny Tschüter, Matthias Weber, Thomas Ilsche, Sebastian Oeste

### Improved Accuracy for Automated Communication Pattern Characterization Using Communication Graphs and Aggressive Search Space Pruning

Abstract
An understanding of a parallel application’s communication behavior is useful for a range of activities including debugging and optimization, job scheduling, target system selection, and system design. Because it can be challenging to understand communication behavior, especially for those who lack expertise or who are not familiar with the application, I and two colleagues recently developed an automated, search-based approach for recognizing and parameterizing application communication behavior using a library of common communication patterns. This initial approach was effective for characterizing the behavior of many workloads, but I identified some combinations of communication patterns for which the method was inefficient or would fail. In this paper, I discuss one such troublesome pattern combination and propose modifications to the recognition method to handle it. Specifically, I propose an alternative approach that uses communication graphs instead of traditional communication matrices to improve recognition accuracy for collective communication operations, and that uses a non-greedy recognition technique to avoid search space dead-ends that trap the original greedy recognition approach. My modified approach uses aggressive search space pruning and heuristics to control the potential for state explosion caused by its non-greedy pattern recognition method. I demonstrate the improved recognition accuracy and pruning efficacy of the modified approach using several synthetic and real-world communication pattern combinations.
Philip C. Roth

### Moya—A JIT Compiler for HPC

Abstract
We describe Moya, an annotation-driven JIT compiler for compiled languages such as Fortran, C and C++. We show that a combination of a small number of easy-to-use annotations coupled with aggressive static analysis that enables dynamic optimization can be used to improve the performance of computationally intensive, long-running numerical applications. We obtain speedups of upto 1.5 on JIT’ed functions and overcome the overheads of the JIT compilation within 25 timesteps in a combustion-simulation application.
Tarun Prabhu, William Gropp

### Polyhedral Optimization of TensorFlow Computation Graphs

Abstract
We present $${\textsf {R}}\text {-}{\textsf {Stream}}{\cdot }{\textsf {TF}}$$, a polyhedral optimization tool for neural network computations. $${\textsf {R}}\text {-}{\textsf {Stream}}{\cdot }{\textsf {TF}}$$ transforms computations performed in a neural network graph into C programs suited to the polyhedral representation and uses R-Stream, a polyhedral compiler, to parallelize and optimize the computations performed in the graph. $${\textsf {R}}\text {-}{\textsf {Stream}}{\cdot }{\textsf {TF}}$$ can exploit the optimizations available with R-Stream to generate a highly optimized version of the computation graph, specifically mapped to the targeted architecture. During our experiments, $${\textsf {R}}\text {-}{\textsf {Stream}}{\cdot }{\textsf {TF}}$$ was able to automatically reach performance levels close to the hand-optimized implementations, demonstrating its utility in porting neural network computations to parallel architectures.

### CAASCADE: A System for Static Analysis of HPC Software Application Portfolios

Abstract
With the increasing complexity of upcoming HPC systems, so-called “co-design” efforts to develop the hardware and applications in concert for these systems also become more challenging. It is currently difficult to gather information about the usage of programming model features, libraries, and data structure considerations in a quantitative way across a variety of applications, and this information is needed to prioritize development efforts in systems software and hardware optimizations. In this paper we propose CAASCADE, a system that can harvest this information in an automatic way in production HPC environments, and we show some early results from a prototype of the system based on GNU compilers and a MySQL database.
M. Graham Lopez, Oscar Hernandez, Reuben D. Budiardja, Jack C. Wells

### Visual Comparison of Trace Files in Vampir

Abstract
Comparing data is a key activity of performance analysis. It is required to relate performance results before and after optimizations, while porting to new hardware, and when using new programming models and libraries. While comparing profiles is straightforward, relating detailed trace data remains challenging.
This work introduces the Comparison View. This new view extends the trace visualizer Vampir to enable comparative visual performance analysis. It displays multiple traces in one synchronized view and adds a range of alignment techniques to aid visual inspection. We demonstrate the Comparison View’s value in three real-world performance analysis scenarios.
Matthias Weber, Ronny Brendel, Michael Wagner, Robert Dietrich, Ronny Tschüter, Holger Brunst

### Understanding the Scalability of Molecular Simulation Using Empirical Performance Modeling

Abstract
Molecular dynamics (MD) simulation allows for the study of static and dynamic properties of molecular ensembles at various molecular scales, from monatomics to macromolecules such as proteins and nucleic acids. It has applications in biology, materials science, biochemistry, and biophysics. Recent developments in simulation techniques spurred the emergence of the computational molecular engineering (CME) field, which focuses specifically on the needs of industrial users in engineering. Within CME, the simulation code ms2 allows users to calculate thermodynamic properties of bulk fluids. It is a parallel code that aims to scale the temporal range of the simulation while keeping the execution time minimal. In this paper, we use empirical performance modeling to study the impact of simulation parameters on the execution time. Our approach is a systematic workflow that can be used as a blue-print in other fields that aim to scale their simulation codes. We show that the generated models can help users better understand how to scale the simulation with minimal increase in execution time.
Sergei Shudler, Jadran Vrabec, Felix Wolf

### Advanced Event-Sampling Support for PAPI

Abstract
The PAPI performance library is a widely used tool for gathering performance data from running applications. Modern processors support advanced sampling interfaces, such as Intel’s Precise Event Based Sampling (PEBS) and AMD’s Instruction Based Sampling (IBS). The current PAPI sampling interface predates the existence of these interfaces and only provides simple instruction-pointer based samples.
We propose a new, improved, sampling interface that provides support for the extended sampling information available on modern hardware. We extend the PAPI interface to add a new PAPI_sample_init call that uses the Linux perf_event interface to access the extra sample information. A pointer to these samples is returned to the user, who can either decode them on the fly, or write them to disk for later analysis.
By providing extended sampling information, this new PAPI interface allows advanced performance analysis and optimization that was previously not possible. This will greatly enhance the ability to optimize software in modern extreme-scale programming environments.
Forrest Smith, Vincent M. Weaver

### ParLoT: Efficient Whole-Program Call Tracing for HPC Applications

Abstract
The complexity of HPC software and hardware is quickly increasing. As a consequence, the need for efficient execution tracing to gain insight into HPC application behavior is steadily growing. Unfortunately, available tools either do not produce traces with enough detail or incur large overheads. An efficient tracing method that overcomes the tradeoff between maximum information and minimum overhead is therefore urgently needed. This paper presents such a method and tool, called ParLoT, with the following key features. (1) It describes a technique that makes low-overhead on-the-fly compression of whole-program call traces feasible. (2) It presents a new, efficient, incremental trace-compression approach that reduces the trace volume dynamically, which lowers not only the needed bandwidth but also the tracing overhead. (3) It collects all caller/callee relations, call frequencies, call stacks, as well as the full trace of all calls and returns executed by each thread, including in library code. (4) It works on top of existing dynamic binary instrumentation tools, thus requiring neither source-code modifications nor recompilation. (5) It supports program analysis and debugging at the thread, thread-group, and program level. This paper establishes that comparable capabilities are currently unavailable. Our experiments with the NAS parallel benchmarks running on the Comet supercomputer with up to 1,024 cores show that ParLoT can collect whole-program function-call traces at an average tracing bandwidth of just 56 kB/s per core.
Saeed Taheri, Sindhu Devale, Ganesh Gopalakrishnan, Martin Burtscher

### Gotcha: An Function-Wrapping Interface for HPC Tools

Abstract
This paper introduces Gotcha, a function wrapping interface and library for HPC tools. Many HPC tools, and performance analysis tools in particular, rely on function wrapping to integrate with applications. But existing mechanisms, such as LD_PRELOAD on Linux, have limitations that lead to tool instability and complexity. Gotcha addresses the limitations in existing mechanisms, provides a programmable interface for HPC tools to manage function wrapping, and supports function wrapping across multiple tools. In addition, this paper introduces the idea of interface-independent function wrapping, which makes it possible for tools to wrap arbitrary application functions.
David Poliakoff, Matt LeGendre

### Projecting Performance Data over Simulation Geometry Using SOSflow and ALPINE

Abstract
The performance of HPC simulation codes is often tied to their simulated domains; e.g., properties of the input decks, boundaries of the underlying meshes, and parallel decomposition of the simulation space. A variety of research efforts have demonstrated the utility of projecting performance data onto the simulation geometry to enable analysis of these kinds of performance problems. However, current methods to do so are largely ad-hoc and limited in terms of extensibility and scalability. Furthermore, few methods enable this projection online, resulting in large storage and processing requirements for offline analysis. We present a general, extensible, and scalable solution for in-situ (online) visualization of performance data projected onto the underlying geometry of simulation codes. Our solution employs the scalable observation system SOSflow with the in-situ visualization framework ALPINE to automatically extract simulation geometry and stream aggregated performance metrics to respective locations within the geometry at runtime. Our system decouples the resources and mechanisms to collect, aggregate, project, and visualize the resulting data, thus mitigating overhead and enabling online analysis at large scales. Furthermore, our method requires minimal user input and modification of existing code, enabling general and widespread adoption.
Chad Wood, Matthew Larsen, Alfredo Gimenez, Kevin Huck, Cyrus Harrison, Todd Gamblin, Allen Malony

### Visualizing, Measuring, and Tuning Adaptive MPI Parameters

Abstract
Adaptive MPI (AMPI) is an advanced MPI runtime environment that offers several features over traditional MPI runtimes, which can lead to a better utilization of the underlying hardware platform and therefore higher performance. These features are overdecomposition through virtualization, and load balancing via rank migration. Choosing which of these features to use, and finding the optimal parameters for them is a challenging task however, since different applications and systems may require different options. Furthermore, there is a lack of information about the impact of each option. In this paper, we present a new visualization of AMPI in its companion Projections tool, which depicts the operation of an MPI application and details the impact of the different AMPI features on its resource usage. We show how these visualizations can help to improve the efficiency and execution time of an MPI application. Applying optimizations indicated by the performance analysis to two MPI-based applications results in performance improvements of up 18% from overdecomposition and load balancing.
Matthias Diener, Sam White, Laxmikant V. Kale

### Visual Analytics Challenges in Analyzing Calling Context Trees

Abstract
Performance analysis is an integral part of developing and optimizing parallel applications for high performance computing (HPC) platforms. Hierarchical data from different sources is typically available to identify performance issues or anomalies. Some hierarchical data such as the calling context can be very large in terms of breadth and depth of the hierarchy. Classic tree visualizations quickly reach their limits in analyzing such hierarchies with the abundance of information to display. In this position paper, we identify the challenges commonly faced by the HPC community in visualizing hierarchical performance data, with a focus on calling context trees. Furthermore, we motivate and lay out the bases of a visualization that addresses some of these challenges.
Alexandre Bergel, Abhinav Bhatele, David Boehme, Patrick Gralka, Kevin Griffin, Marc-André Hermanns, Dušan Okanović, Olga Pearce, Tom Vierjahn

### PaScal Viewer: A Tool for the Visualization of Parallel Scalability Trends

Abstract
Taking advantage of the growing number of cores in supercomputers to increase the scalability of parallel programs is an increasing challenge. Many advanced profiling tools have been developed to assist programmers in the process of analyzing data related to the execution of their program. Programmers can act upon the information generated by these data and make their programs reach higher performance levels. However, the information provided by profiling tools is generally designed to optimize the program for a specific execution environment, with a target number of cores and a target problem size. A code optimization driven towards scalability rather than specific performance requires the analysis of many distinct execution environments instead of details about a single environment. With the goal of providing more useful information for the analysis and optimization of code for parallel scalability, this work introduces the PaScal Viewer tool. It presents an novel and productive way to visualize scalability trends of parallel programs. It consists of four diagrams that offers visual support to identify parallel efficiency trends of the whole program, or parts of it, when running on scaling parallel environments with scaling problem sizes.
Anderson B. N. da Silva, Daniel A. M. Cunha, Vitor R. G. Silva, Alex F. de A. Furtunato, Samuel Xavier-de-Souza

### Using Deep Learning for Automated Communication Pattern Characterization: Little Steps and Big Challenges

Abstract
Characterization of a parallel application’s communication patterns can be useful for performance analysis, debugging, and system design. However, obtaining and interpreting a characterization can be difficult. AChax implements an approach that uses search and a library of known communication patterns to automatically characterize communication patterns. Our approach has some limitations that reduce its effectiveness for the patterns and pattern combinations used by some real-world applications. By viewing AChax’s pattern recognition problem as an image recognition problem, it may be possible to use deep learning to address these limitations. In this position paper, we present our current ideas regarding the benefits and challenges of integrating deep learning into AChax and our conclusion that a hybrid approach combining deep learning classification, regression, and the existing AChax approach may be the best long-term solution to the problem of parameterizing recognized communication patterns.
Philip C. Roth, Kevin Huck, Ganesh Gopalakrishnan, Felix Wolf

### Visualizing Multidimensional Health Status of Data Centers

Abstract
Monitoring data centers is challenging due to their size, complexity, and dynamic nature. This project proposes a visual approach for situational awareness and health monitoring of high-performance computing systems. The visualization requirements are expanded on the following dimensions: (1) High performance computing spatial layout, (2) Temporal domain (historical vs. real-time tracking), and (3) System health services such as temperature, CPU load, memory usage, fan speed, and power consumption. To show the effectiveness of our design, we demonstrate the developed prototype on a medium-scale data center of 10 racks and 467 hosts. The work was developed using feedback from both industrial and acadamic domain experts.
Tommy Dang