scroll identifier for mobile
main-content

## Über dieses Buch

This book constitutes the refereed post-conference proceedings of 13 workshops held at the 33rd International ISC High Performance 2018 Conference, in Frankfurt, Germany, in June 2018: HPC I/O in the Data Center, HPC-IODC 2018; Workshop on Performance and Scalability of Storage Systems, WOPSSS 2018; 13th Workshop on Virtualization in High­-Performance Cloud Computing, VHPC 2018; Third International Workshop on In Situ Visualization, WOIV 2018; 4th International Workshop on Communication Architectures for HPC, Big Data, Deep Learning and Clouds at Extreme Scale, ExaComm 2018; International Workshop on OpenPOWER for HPC, IWOPH 2018; IXPUG Workshop: Many-Core Computing on Intel Processors; Workshop on Sustainable Ultrascale Computing Systems; Approximate and Transprecision Computing on Emerging Technologies, ATCET 2018; First Workshop on the Convergence of Large-Scale Simulation and Artificial Intelligence; Third Workshop for Open Source Supercomputing, OpenSuCo 2018; First Workshop on Interactive High-Performance Computing; Workshop on Performance Portable Programming Models for Accelerators, P^3MA 2018.
The 53 full papers included in this volume were carefully reviewed and selected from 80 submissions. They cover all aspects of research, development, and application of large-scale, high performance experimental and commercial systems. Topics include HPC computer architecture and hardware; programming models, system software, and applications; solutions for heterogeneity, reliability, power efficiency of systems; virtualization and containerized environments; big data and cloud computing; and artificial intelligence.

## Inhaltsverzeichnis

### Analyzing the I/O Scalability of a Parallel Particle-in-Cell Code

Understanding the I/O behavior of parallel applications is fundamental both to optimize and propose tuning strategies for improving the I/O performance. In this paper we present the outcome of an I/O optimization project carried out for the parallel astrophysical Plasma Physics application Acronym, a well-tested particle-in-cell code for astrophysical simulations. Acronym is used on several different supercomputers in combination with the HDF5 library, providing the output in form of self-describing files. To address the project, we did a characterization of the main parallel I/O sub-system operated at LRZ. Afterwards we have applied two different strategies that improve the initial performance, providing a solution with scalable I/O. The results obtained show that the total application time is 4.5x faster than the original version for the best case.

Sandra Mendez, Nicolay J. Hammer, Anupam Karmakar

### Cost and Performance Modeling for Earth System Data Management and Beyond

Current and anticipated storage environments confront domain scientist and data center operators with usability, performance and cost challenges. The amount of data upcoming system will be required to handle is expected to grow exponentially, mainly due to increasing resolution and affordable compute power. Unfortunately, the relationship between cost and performance is not always well understood requiring considerable effort for educated procurement. Within the Centre of Excellence in Simulation of Weather and Climate in Europe (ESiWACE) models to better understand cost and performance of current and future systems are being explored. This paper presents models and methodology focusing on, but not limited to, data centers used in the context of climate and numerical weather prediction. The paper concludes with a case study of alternative deployment strategies and outlines the challenges anticipating their impact on cost and performance. By publishing these early results, we would like to make the case to work towards standard models and methodologies collaboratively as a community to create sufficient incentives for vendors to provide specifications in formats which are compatible to these modeling tools. In addition to that, we see application for such formalized models and information in I/O related middleware, which are expected to make automated but reasonable decisions in increasingly heterogeneous data centers.

Jakob Lüttgau, Julian Kunkel

### I/O Interference Alleviation on Parallel File Systems Using Server-Side QoS-Based Load-Balancing

Storage performance in supercomputers is variable, depending not only on an application’s workload but also on the types of other concurrent I/O activities. In particular, performance degradation in meta-data accesses leads to poor storage performance across applications running at the same time. We herein focus on two representative performance problems, high load and slow response of a meta-data server, through analysis of meta-data server activities using file system performance metrics on the K computer. We investigate the root causes of such performance problems through MDTEST benchmark runs and confirm the performance improvement by server-side quality-of-service management in service thread assignment for incoming client requests on a meta-data server.

Yuichi Tsujita, Yoshitaka Furutani, Hajime Hida, Keiji Yamamoto, Atsuya Uno, Fumichika Sueyasu

### Tools for Analyzing Parallel I/O

Parallel application I/O performance often does not meet user expectations. Additionally, slight access pattern modifications may lead to significant changes in performance due to complex interactions between hardware and software. These issues call for sophisticated tools to capture, analyze, understand, and tune application I/O.In this paper, we highlight advances in monitoring tools to help address these issues. We also describe best practices, identify issues in measurement and analysis, and provide practical approaches to translate parallel I/O analysis into actionable outcomes for users, facility operators, and researchers.

Julian Martin Kunkel, Eugen Betke, Matt Bryson, Philip Carns, Rosemary Francis, Wolfgang Frings, Roland Laifer, Sandra Mendez

### Understanding Metadata Latency with MDWorkbench

While parallel file systems often satisfy the need of applications with bulk synchronous I/O, they lack capabilities of dealing with metadata intense workloads. Typically, in procurements, the focus lies on the aggregated metadata throughput using the MDTest benchmark ( https://www.vi4io.org/tools/benchmarks/mdtest ). However, metadata performance is crucial for interactive use. Metadata benchmarks involve even more parameters compared to I/O benchmarks. There are several aspects that are currently uncovered and, therefore, not in the focus of vendors to investigate. Particularly, response latency and interactive workloads operating on a working set of data. The lack of capabilities from file systems can be observed when looking at the IO-500 list, where metadata performance between best and worst system does not differ significantly.In this paper, we introduce a new benchmark called MDWorkbench which generates a reproducible workload emulating many concurrent users or – in an alternative view – queuing systems. This benchmark provides a detailed latency profile, overcomes caching issues, and provides a method to assess the quality of the observed throughput. We evaluate the benchmark on state-of-the-art parallel file systems with GPFS (IBM Spectrum Scale), Lustre, Cray’s Datawarp, and DDN IME, and conclude that we can reveal characteristics that could not be identified before.

Julian Martin Kunkel, George S. Markomanolis

### From Application to Disk: Tracing I/O Through the Big Data Stack

Typical applications in data science consume, process and produce large amounts of data, making disk I/O one of the dominating—and thus worthwhile optimizing—factors of their overall performance. Distributed processing frameworks, such as Hadoop, Flink and Spark, hide a lot of complexity from the programmer when they parallelize these applications across a compute cluster. This exacerbates reasoning about I/O of both the application and the framework, through the distributed file system, such as HDFS, down to the local file systems.We present SFS (Statistics File System), a modular framework to trace each I/O request issued by the application and any JVM-based big data framework involved, mapping these requests to actual disk I/O.This allows detection of inefficient I/O patterns, both by the applications and the underlying frameworks, and builds the basis for improving I/O scheduling in the big data software stack.

Robert Schmidtke, Florian Schintke, Thorsten Schütt

### IOscope: A Flexible I/O Tracer for Workloads’ I/O Pattern Characterization

Storage systems are getting complex to handle HPC and Big Data requirements. This complexity triggers performing in-depth evaluations to ensure the absence of issues in all systems’ layers. However, the current performance evaluation activity is performed around high-level metrics for simplicity reasons. It is therefore impossible to catch potential I/O issues in lower layers along the Linux I/O stack. In this paper, we introduce IOscope tracer for uncovering I/O patterns of storage systems’ workloads. It performs filtering-based profiling over fine-grained criteria inside Linux kernel. IOscope has near-zero overhead and verified behaviours inside the kernel thanks to relying on the extended Berkeley Packet Filter (eBPF) technology. We demonstrate the capabilities of IOscope to discover patterns-related issues through a performance study on MongoDB and Cassandra. Results show that clustered MongoDB suffers from a noisy I/O pattern regardless of the used storage support (HDDs or SSDs). Hence, IOscope helps to have better troubleshooting process and contributes to have in-depth understanding of I/O performance.

Abdulqawi Saif, Lucas Nussbaum, Ye-Qiong Song

### Exploring Scientific Application Performance Using Large Scale Object Storage

One of the major performance and scalability bottlenecks in large scientific applications is parallel reading and writing to supercomputer I/O systems. The usage of parallel file systems and consistency requirements of POSIX, that all the traditional HPC parallel I/O interfaces adhere to, pose limitations to the scalability of scientific applications. Object storage is a widely used storage technology in cloud computing and is more frequently proposed for HPC workload to address and improve the current scalability and performance of I/O in scientific applications. While object storage is a promising technology, it is still unclear how scientific applications will use object storage and what the main performance benefits will be. This work addresses these questions, by emulating an object storage used by a traditional scientific application and evaluating potential performance benefits. We show that scientific applications can benefit from the usage of object storage on large scales.

Steven Wei-der Chien, Stefano Markidis, Rami Karim, Erwin Laure, Sai Narasimhamurthy

### Benefit of DDN’s IME-FUSE for I/O Intensive HPC Applications

Many scientific applications are limited by I/O performance offered by parallel file systems on conventional storage systems. Flash-based burst buffers provide significant better performance than HDD backed storage, but at the expense of capacity. Burst buffers are considered as the next step towards achieving wire-speed of interconnect and providing more predictable low latency I/O, which are the holy grail of storage.A critical evaluation of storage technology is mandatory as there is no long-term experience with performance behavior for particular applications scenarios. The evaluation enables data centers choosing the right products and system architects the integration in HPC architectures.This paper investigates the native performance of DDN-IME, a flash-based burst buffer solution. Then, it takes a closer look at the IME-FUSE file systems, which uses IMEs as burst buffer and a Lustre file system as back-end. Finally, by utilizing a NetCDF benchmark, it estimates the performance benefit for climate applications.

Eugen Betke, Julian Kunkel

### Performance Study of Non-volatile Memories on a High-End Supercomputer

The first exa-scale supercomputers are expected to be operational in China, USA, Japan and Europe within the early 2020’s. This allows scientists to execute applications at extreme scale with more than $$10^{18}$$ floating point operations per second (exa-FLOPS). However, the number of FLOPS is not the only parameter that determines the final performance. In order to store intermediate results or to provide fault tolerance, most applications need to perform a considerable amount of I/O operations during runtime. The performance of those operations is determined by the throughput from volatile (e.g. DRAM) to non-volatile stable storage. Regarding the slow growth in network bandwidth compared to the computing capacity on the nodes, it is highly beneficial to deploy local stable storage such as the new non-volatile memories (NVMe), in order to avoid the transfer through the network to the parallel file system. In this work, we analyse the performance of three different storage levels of the CTE-POWER9 cluster, located at the Barcelona Supercomputing Center (BSC). We compare the throughputs of SSD, NVMe on the nodes to the GPFS under various scenarios and settings. We measured a maximum performance on 16 nodes of 83 GB/s using NVMe devices, 5.6 GB/s for SSD devices and 4.4 GB/s for writes to the GPFS.

Leonardo Bautista Gomez, Kai Keller, Osman Unsal

### Self-optimization Strategy for IO Accelerator Parameterization

Exascale reaching imposes a high automation level on HPC supercomputers. In this paper, a self-optimization strategy is proposed to improve application IO performance using statistical and machine learning based methods.The proposed method takes advantage of collected IO data through an off-line analysis to infers the most relevant parameterization of an IO accelerator that should be used for the next launch of a similar job. This is thus a continuous improvement process that will converge toward an optimal parameterization along iterations.The inference process uses a numerical optimization method to propose the parameterization that minimizes the execution time of the considered application. A regression method is used to model the objective function to be optimized from a sparse set of collected data from the past runs.Experiments on different artificial parametric spaces show that the convergence speed of the proposed method requires less than 20 runs to converge toward a parameterization of the IO accelerator.

Lionel Vincent, Mamady Nabe, Gaël Goret

### utmem: Towards Memory Elasticity in Cloud Workloads

In environments where multiple virtual machines are colocated on the same physical host, the semantic gap between the host and the guests leads to suboptimal memory management. Solutions such as ballooning are unable to modify the amount of memory available to the guest fast enough to avoid performance degradation. Alternatives such as Transcendent Memory allow the guest to use host memory instead of swapping to disk. All these techniques are applied at the memory management subsystem level, resulting in cases where abrupt changes in memory utilization cause unnecessary guest-side swapping. We propose Userspace Transcendent Memory (utmem), a version of Transcendent Memory that can be directly utilized by applications without interference from the guest OS. Our results demonstrate that our approach succeeds in allowing the guests to rapidly adjust the amount of memory they use more efficiently than both ballooning and Transcendent Memory.

Aimilios Tsalapatis, Stefanos Gerangelos, Stratos Psomadakis, Konstantinos Papazafeiropoulos, Nectarios Koziris

### Efficient Live Migration of Linux Containers

In recent years, operating system level virtualization has grown in popularity due to its capability to isolate multiple userspace environments and to allow for their co-existence within a single OS kernel instance. Checkpoint-restore in Userspace (CRIU) is a tool that allows to live migrate a hierarchy of processes – a container – between two physical computers. However, the live migration may cause significant delays when the applications running inside a container modify large amounts of memory faster than a container can be transferred over the network to a remote host. In this paper, we propose a novel approach for live migration of containers to address this issue by utilizing a recently published CRIU feature, the so-called “image cache/proxy”. This feature allows for better total migration time and down time of the container applications that are migrated by avoiding the use of secondary storage.

### Coupling the Uintah Framework and the VisIt Toolkit for Parallel In Situ Data Analysis and Visualization and Computational Steering

Data analysis and visualization are an essential part of the scientific discovery process. As HPC simulations have grown, I/O has become a bottleneck, which has required scientists to turn to in situ tools for simulation data exploration. Incorporating additional data, such as runtime performance data, into the analysis or I/O phases of a workflow is routinely avoided for fear of excaberting performance issues. The paper presents how the Uintah Framework, a suite of HPC libraries and applications for simulating complex chemical and physical reactions, was coupled with VisIt, an interactive analysis and visualization toolkit, to allow scientists to perform parallel in situ visualization of simulation and runtime performance data. An additional benefit of the coupling made it possible to create a “simulation dashboard” that allowed for in situ computational steering and visual debugging.

Allen Sanderson, Alan Humphrey, John Schmidt, Robert Sisneros

### Binning Based Data Reduction for Vector Field Data of a Particle-In-Cell Fusion Simulation

With this work, we explore the feasibility of using in situ data binning techniques to achieve significant data reductions for particle data, and study the associated errors for several post-hoc analysis techniques. We perform an application study in collaboration with fusion simulation scientists on data sets up to 489 GB per time step. We consider multiple ways to carry out the binning, and determine which techniques work the best for this simulation. With the best techniques we demonstrate reduction factors as large as 109x with low error percentage.

James Kress, Jong Choi, Scott Klasky, Michael Churchill, Hank Childs, David Pugmire

### In Situ Analysis and Visualization of Fusion Simulations: Lessons Learned

The trends in high performance computing, where far more data can be computed that can ever be stored, have made in situ techniques an important area of research and development. Simulation campaigns, where domain scientists work with computer scientists to run a simulation and perform in situ analysis and visualization are important, and complex undertakings. In this paper we report our experiences performing in situ analysis and visualization on two campaigns. The two campaigns were related, but had important differences in terms of the codes that were used, the types of analysis and visualization required, and the visualization tools used. Further, we report the lessons learned from each campaign.

Mark Kim, James Kress, Jong Choi, Norbert Podhorszki, Scott Klasky, Matthew Wolf, Kshitij Mehta, Kevin Huck, Berk Geveci, Sujin Phillip, Robert Maynard, Hanqi Guo, Tom Peterka, Kenneth Moreland, Choong-Seock Chang, Julien Dominski, Michael Churchill, David Pugmire

### Design of a Flexible In Situ Framework with a Temporal Buffer for Data Processing and Visualization of Time-Varying Datasets

This paper presents an in situ framework focused on time-varying simulations, and uses a novel temporal buffer for storing simulation results sampled at user-defined intervals. This framework has been designed to provide flexible data processing and visualization capabilities in modern HPC operational environments composed of powerful front-end systems, for pre-and post-processing purposes, along with traditional back-end HPC systems. The temporal buffer is implemented using the functionalities provided by Open Address Space (OpAS) library, which enables asynchronous one-sided communication from outside processes to any exposed memory region on the simulator side. This buffer can store time-varying simulation results, and can be processed via in situ approaches with different proximities. We present a prototype of our framework, and code integration process with a target simulation code. The proposed in situ framework utilizes separate files to describe the initialization and execution codes, which are in the form of Python scripts. This framework also enables the runtime modification of these Python-based files, thus providing greater flexibility to the users, not only for data processing, such as visualization and analysis, but also for the simulation steering.

Kenji Ono, Jorji Nonaka, Hiroyuki Yoshikawa, Takeshi Nanri, Yoshiyuki Morie, Tomohiro Kawanabe, Fumiyoshi Shoji

### Streaming Live Neuronal Simulation Data into Visualization and Analysis

Neuroscientists want to inspect the data their simulations are producing while these are still running. This will on the one hand save them time waiting for results and therefore insight. On the other, it will allow for more efficient use of CPU time if the simulations are being run on supercomputers. If they had access to the data being generated, neuroscientists could monitor it and take counter-actions, e.g., parameter adjustments, should the simulation deviate too much from in-vivo observations or get stuck.As a first step toward this goal, we devise an in situ pipeline tailored to the neuroscientific use case. It is capable of recording and transferring simulation data to an analysis/visualization process, while the simulation is still running. The developed libraries are made publicly available as open source projects. We provide a proof-of-concept integration, coupling the neuronal simulator NEST to basic 2D and 3D visualization.

Simon Oehrl, Jan Müller, Jan Schnathmeier, Jochen Martin Eppler, Alexander Peyser, Hans Ekkehard Plesser, Benjamin Weyers, Bernd Hentschel, Torsten W. Kuhlen, Tom Vierjahn

### Enabling Explorative Visualization with Full Temporal Resolution via In Situ Calculation of Temporal Intervals

We explore a technique for saving full spatiotemporal simulation data for visualization and analysis. While such data is typically prohibitively large to store, we consider an in situ reduction approach that takes advantage of temporal coherence to make storage sizes tractable in some cases. Rather than limiting our data reduction to individual time slices or time windows, our algorithms act on individual locations and save data to disk as temporal intervals. Our results show that the efficacy of piecewise approximations varies based on the desired error bound guarantee and tumultuousness of the time-varying data. We ran our in situ algorithms for one simulation and experienced promising results compared to the traditional paradigm. We also compared the results to two data reduction operators: wavelets and SZ.

Nicole Marsaglia, Shaomeng Li, Hank Childs

### In-Situ Visualization of Solver Residual Fields

Whereas the design and development of numerical solvers for field-based simulations is a highly evolved discipline, and whereas there exists a wide range of visualization techniques for the (in-situ) analysis of their numerical results, the techniques for analyzing the operation of such solvers are rather elementary. In this paper, we present a visualization approach for in-situ analysis of the processes within numerical solvers. That is, instead of visualizing the data that result from such solvers, we address the visualization of the processes that generate the data. We exemplify our approach using different simulation runs, and discuss its in-situ application in high-performance computing environments.

Kai Sdeo, Boyan Zheng, Marian Piatkowski, Filip Sadlo

### An In-Situ Visualization Approach for the K Computer Using Mesa 3D and KVS

Although K computer has been operational for more than five years, it is still ranked in the top 10 of the Top500 list, and in active use, especially in Japan. One of the peculiarity of this system is the use of SPARC64fx CPU, with no instruction set compatibility with other traditional CPU architecture, and the use of a two-staged parallel file system, where the necessary data is moved from the user accessible GFS (Global File System) to a faster LFS (Local File System) for enabling high performance I/O during the simulation run. Since the users have no access to the data during the simulation run, the tightly coupled (co-processing) in-situ visualization approach seems to be the most suitable approach for this HPC system. For the visualization purposes, the hardware developer (Fujitsu) did not provide or support the traditional Mesa 3D graphics library on their SPARC64fx CPU, and in exchange, it provided a non-OSS (Open Source Software) and non-OpenGL visualization library with Particle-Based Volume Rendering (PBVR) implementation, including an API for in-situ visualization. In order to provide a more traditional in-situ visualization alternative for the K computer users, we focused on the Mesa 3D graphics library, and on an OpenGL-based KVS (Kyoto Visualization System) library. We expect that this approach can also be useful on other SPARC64fx HPC environments because of the binary compatibility.

Kengo Hayashi, Naohisa Sakamoto, Jorji Nonaka, Motohiko Matsuda, Fumiyoshi Shoji

### Comparing Controlflow and Dataflow for Tensor Calculus: Speed, Power, Complexity, and MTBF

Milos Kotlar, Veljko Milutinovic

### Supercomputer in a Laptop: Distributed Application and Runtime Development via Architecture Simulation

Architecture simulation can aid in predicting and understanding application performance, particularly for proposed hardware or large system designs that do not exist. In network design studies for high-performance computing, most simulators focus on the dominant message passing (MPI) model. Currently, many simulators build and maintain their own simulator-specific implementations of MPI. This approach has several drawbacks. Rather than reusing an existing MPI library, simulator developers must implement all semantics, collectives, and protocols. Additionally, alternative runtimes like GASNet cannot be simulated without again building a simulator-specific version. It would be far more sustainable and flexible to maintain lower-level layers like uGNI or IB-verbs and reuse the production runtime code. Directly building and running production communication runtimes inside a simulator poses technical challenges, however. We discuss these challenges and show how they are overcome via the macroscale components for the Structural Simulation Toolkit (SST), leveraging a basic source-to-source tool to automatically adapt production code for simulation. SST is able to encapsulate and virtualize thousands of MPI ranks in a single simulator process, providing a “supercomputer in a laptop” environment. We demonstrate the approach for the production GASNet runtime over uGNI running inside SST. We then discuss the capabilities enabled, including investigating performance with tunable delays, deterministic debugging of race conditions, and distributed debugging with serial debuggers.

Samuel Knight, Joseph P. Kenny, Jeremiah J. Wilke

### CGYRO Performance on Power9 CPUs and Volta GPUs

CGYRO, an Eulerian gyrokinetic solver designed and optimized for collisional, electromagnetic, multiscale fusion plasma simulation, has been ported and benchmarked on a Summit-like Power9-based system equipped with Volta GPUs. We present our experience porting the application and provide benchmark numbers obtained on the Power-based node and compare them with equivalent tests from several leadership class systems. The tested node provided the fastest single-node CGYRO runtimes we’ve measured to date.

I. Sfiligoi, J. Candy, M. Kostuk

### A 64-GB Sort at 28 GB/s on a 4-GPU POWER9 Node for Uniformly-Distributed 16-Byte Records with 8-Byte Keys

Govinderaju et al. [1] have shown that a hybrid CPU-GPU system is cost-performance effective at sorting large datasets on a single node, but thus far large clusters used on sorting benchmarks have been limited by network and storage performance, and such clusters have remained CPU-only. With network and storage bandwidths improving more rapidly than CPU throughput performance, the cost effectiveness of CPU-GPU clusters for large sorts should be re-examined. As a first step, we evaluate sort performance on a single GPU-accelerated node with initial and final data residing in system memory. Access to main memory is limited to two reads and two writes, while executing the partitioning and sort in GPU memory. On a dual-socket IBM POWER9 system with four NVlink-attached NVIDIA V100 GPUs a single-node sort of 64 GB 8-byte key, 8-byte value records completes in under 2.3 s corresponding to a sort rate of over 28 GB/s. On a small (4-node) cluster with the same amount of data per node, the cluster sort completes in under 4.5 s. Sort performance is enabled by high system memory bandwidth, managing system-memory NUMA affinities, high CPU-GPU bandwidth, an efficient GPU-based partitioner, and an optimized GPU sort implementation. A cluster version of the algorithm benefits from minimizing copy operations by using RDMA. Matching the throughput of an optimized partitioner for our system would require a 50-100 GB/s network, which is feasible with a dual-socket POWER9 system.

Gordon C. Fossum, Ting Wang, H. Peter Hofstee

### Early Experience on Running OpenStaPLE on DAVIDE

In this contribution we measure the computing and energy performance of the recently developed DAVIDE HPC-cluster, a massively parallel machine based on IBM POWER CPUs and NVIDIA Pascal GPUs. We use as an application benchmark the OpenStaPLE Lattice QCD code, written using the OpenACC programming framework. Our code exploits the computing performance of GPUs through the use of OpenACC directives, and uses OpenMPI to manage the parallelism among several GPUs. We analyze the speed-up and the aggregate performance of the code, and try to identify possible bottlenecks that harm performances. Using the power monitor tools available on DAVIDE we also discuss some energy aspects pointing out the best trade-offs between time-to-solution and energy-to-solution.

Claudio Bonati, Enrico Calore, Massimo D’Elia, Michele Mesiti, Francesco Negro, Sebastiano Fabio Schifano, Giorgio Silvi, Raffaele Tripiccione

### Porting and Benchmarking of BWAKIT Pipeline on OpenPOWER Architecture

Next Generation Sequencing (NGS) technology produces large volumes of genome data, which gets processed using various open source bioinformatics tools. The configuration and compilation of some bioinformatics tools (e.g. BWAKIT, root) is a challenging activity in its own right, not to mention the need to perform more elaborate porting activities for these applications on some architectures (e.g. IBM Power). The best practices of application porting should ensure (i) the semantics of the program or algorithm should not be changed, (ii) the output generated from the original source code and the modified source code (i.e., after porting) should be same even though the code is ported into different architectures and (iii) the output should be similar across different architectures after porting. Burrows-Wheeler Aligner (BWA) is the most popular genome mapping application used in the BWAKIT toolset. This BWAKIT provides pre-compiled binaries for x86_64 architecture and an end-to-end solution for genome mapping. In this paper, we show how to port various pre-built application binaries used in BWAKIT into OpenPOWER architecture and execute the BWAKIT pipeline successfully. Additionally, we demonstrate the validity of output results on OpenPOWER as well as present benchmarking results of BWAKIT applications that indicate the suitability of the highly multithreaded OpenPOWER architecture to execute these applications.

Nagarajan Kathiresan, Rashid Al-Ali, Puthen Jithesh, Ganesan Narayanasamy, Zaid Al-Ars

### Improving Performance and Energy Efficiency on OpenPower Systems Using Scalable Hardware-Software Co-design

Exascale level of High Performance Computing (HPC) implies performance under stringent power constraints. Achieving power consumption targets for HPC systems requires hardware-software co-design to manage static and dynamic power consumption. We present extensions to the open source Global Extensible Open Power Manager (GEOPM) framework, which allows for rapid prototyping of various power and performance optimization strategies for exascale workloads. We have ported GEOPM to OpenPower $${^{\textregistered }}$$ architecture and have used our modifications to investigate performance and power consumption optimization strategies for real-world scientific applications.

Miloš Puzović, Vadim Elisseev, Kirk Jordan, James Mcdonagh, Alexander Harrison, Robert Sawko

### Porting DMRG++ Scientific Application to OpenPOWER

With the rapidly changing microprocessor designs and architectural diversity (multi-cores, many-cores, accelerators) for the next generation HPC systems, scientific applications must adapt to the hardware, to exploit the different types of parallelism and resources available in the architecture. To get the benefit of all the in-node hardware threads, it is important to use a single programming model to map and coordinate the available work to the different heterogeneous execution units in the node (e.g., multi-core hardware threads (latency optimized), accelerators (bandwidth optimized), etc.).Our goal is to show that we can manage the node complexity of these systems by using OpenMP for in-node parallelization by exploiting different “programming styles” supported by OpenMP 4.5 to program CPU cores and accelerators. Finding out the suitable programming-style (e.g., SPMD style, multi-level tasks, accelerator programming, nested parallelism, or a combination of these) using the latest features of OpenMP to maximize performance and achieve performance portability across heterogeneous and homogeneous systems is still an open research problem.We developed a mini-application, Kronecker Product (KP), from the original DMRG++ application (sparse matrix algebra) computational motif to experiment with different OpenMP programming styles on an OpenPOWER architecture and present their results in this paper.

Arghya Chatterjee, Gonzalo Alvarez, Eduardo D’Azevedo, Wael Elwasif, Oscar Hernandez, Vivek Sarkar

### Job Management with mpi_jm

Access to Leadership computing is required for HPC applications that require a large fraction of compute nodes for a single computation and also for use cases where the volume of smaller tasks can only be completed in a competitive or reasonable time frame through use of these Leadership computing facilities. In the latter case, a robust and lightweight manager is ideal so that all these tasks can be computed in a machine-friendly way, notably with minimal use of mpirun or equivalent to launch the executables (simple bundling of tasks can over-tax the service nodes and crash the entire scheduler). Our library, mpi_jm, can manage such allocations, provided access to the requisite MPI functionality is provided. mpi_jm is fault-tolerant against a modest number of down or non-communicative nodes, can begin executing work on smaller portions of a larger allocation before all nodes become available for the allocation, can manage GPU-intensive and CPU-only work independently and can overlay them peacefully on shared nodes. It is easily incorporated into existing MPI-capable executables, which then can run both independently and under mpi_jm management. It provides a flexible Python interface, unlocking many high-level libraries, while also tightly binding users’ executables to hardware.

Evan Berkowitz, Gustav Jansen, Kenneth McElvain, André Walker-Loud

### Compile-Time Library Call Detection Using CAASCADE and XALT

CAASCADE — Compiler-Assisted Application Source Code Analysis and DatabasE—is a tool that summarizes the use of parallel programming language features in application source code using compiler technology. This paper discusses the library detection capability within CAASCADE to find information about the usage of scientific libraries within the source code. The information that CAASCADE collects provides insights into the usage of library calls in an applications. CAASCADE can classify the APIs by scientific libraries (e.g. LAPACK, BLAS, FFTW, etc). It can also detect the context in which a library API is being invoked, for example within a serial or multi-threaded region. To collect this information, CAASCADE uses compiler plugins that summarize procedural information and uses Apache Spark to do inter-procedural analysis to reconstruct call chains. In addition to this, we also integrated CAASCADE to work with XALT to collect library information based on linkage and modules installed on a system.

Jisheng Zhao, Oscar R. Hernandez, Reuben D. Budiardja, M. Graham Lopez, Vivek Sarkar, Jack C. Wells

### NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

High-performance computing increasingly relies on heterogeneous systems with specialized hardware accelerators to improve application performance. For example, NVIDIA’s CUDA programming system and general-purpose GPUs have emerged as a widespread accelerator in HPC systems. This trend has exacerbated challenges of data placement as accelerators often have fast local memories to fuel their computational demands, but slower interconnects to feed those memories. Crucially, real-world data-transfer performance is strongly influenced not just by the underlying hardware, but by the capabilities of the programming systems. Understanding how application performance is affected by the logical communication exposed through abstractions, as well as the underlying system topology, is crucial for developing high-performance applications and architectures. This report presents initial data-transfer microbenchmark results from two POWER-based systems obtained during work towards developing an automated system performance characterization tool.

Carl Pearson, I-Hsin Chung, Zehra Sura, Wen-Mei Hwu, Jinjun Xiong

### Sparse CSB_Coo Matrix-Vector and Matrix-Matrix Performance on Intel Xeon Architectures

The CSB_Coo sparse matrix format is especially useful in situations such as eigenvalue problems where efficient SPMV and transposed SPMV_T operations are required. One strategy to increase the arithmetic intensity of large scale parallel solvers is to use a blocked eigensolver such LOBPCG and to operate on blocks of vectors to achieve greater performance. However, this solution is not always practical as MPI communication may be higher leading to inefficiencies or the increased memory usage of dense vectors may be impractical. Additionally the Lanczos algorithm is well tested in production and may be preferred in some situations. On modern architectures vectorization is key for obtaining good performance. In this paper we show the performance optimization and benefits of vectorization with AVX-512 Conflict Detection (CD) instructions in the case of a standard SPMV operation on a single vector. We also present a modified version of the CSB_Coo format which allows more efficient vector operations. We compare and analyze performance on Haswell, Xeon Phi (KNL and KNM) and Intel Xeon Scalable processors (Skylake).

Brandon Cook, Charlene Yang, Thorsten Kurth, Jack Deslippe

### Lessons Learned from Optimizing Kernels for Adaptive Aggregation Multi-grid Solvers in Lattice QCD

In recent years, adaptive aggregation multi-grid (AAMG) methods have become the gold standard for solving the Dirac equation in Lattice QCD (LQCD) using Wilson-Clover fermions. These methods are able to overcome the critical slowing down as quark masses approach their physical values and are thus the go-to method for performing Lattice QCD calculations at realistic physical parameters. In this paper we discuss the optimization of a specific building block for implementing AAMG for Wilson-Clover fermions from LQCD, known as the coarse restrictor operator, on contemporary Intel processors featuring large SIMD widths and high thread counts. We will discuss in detail the efficient use of OpenMP and Intel vector intrinsics in our attempts to exploit fine grained parallelism on the coarsest levels. We present performance optimizations and discuss the ramifications for implementing a full AAMG stack on Intel Xeon Phi Knights Landing and Skylake processors.

Bálint Joó, Thorsten Kurth

### Distributed Training of Generative Adversarial Networks for Fast Detector Simulation

The simulation of the interaction of particles in High Energy Physics detectors is a computing intensive task. Since some level of approximation is acceptable, it is possible to implement fast simulation simplified models that have the advantage of being less computationally intensive. Here we present a fast simulation based on Generative Adversarial Networks (GANs). The model is constructed from a generative network describing the detector response and a discriminative network, trained in adversarial manner. The adversarial training process is compute-intensive and the application of a distributed approach becomes particularly important. We present scaling results of a data-parallel approach to distribute GANs training across multiple nodes on TACC’s Stampede2. The efficiency achieved was above 94% when going from 1 to 128 Xeon Scalable Processor nodes. We report on the accuracy of the generated samples and on the scaling of time-to-solution. We demonstrate how HPC installations could be utilized to globally optimize this kind of models leading to quicker research cycles and experimentation, thanks to their large computation power and excellent connectivity.

Sofia Vallecorsa, Federico Carminati, Gulrukh Khattak, Damian Podareanu, Valeriu Codreanu, Vikram Saletore, Hans Pabst

### Cache-Aware Roofline Model and Medical Image Processing Optimizations in GPUs

When optimizing or porting applications to new architectures, a preliminary characterization is necessary to exploit the maximum computing power of the employed devices. Profiling tools are available for numerous architectures and programming models, making it easier to spot possible bottlenecks. However, for a better interpretation of the collected results, current profilers rely on insightful performance models. In this paper, we describe the Cache Aware Roofline Model (CARM) and tools for its generation to enable the performance characterization of GPU architectures and workloads. We use CARM to characterize two kernels that are part of a 3D iterative reconstruction application for Computed Tomography (CT). These two kernels take most of the execution time of the whole method, being therefore suitable for a deeper analysis. By exploring the model and the methodology proposed, the overall performance of the kernels has been improved up to two times compared to the previous implementations.

Estefania Serrano, Aleksandar Ilic, Leonel Sousa, Javier Garcia-Blas, Jesus Carretero

### How Pre-multicore Methods and Algorithms Perform in Multicore Era

Many classical methods and algorithms developed when single-core CPUs dominated the parallel computing landscape, are still widely used in the changed multicore world. Two prominent examples are load balancing, which has been one of the main techniques for minimization of the computation time of parallel applications since the beginning of parallel computing, and model-based power/energy measurement techniques using performance events. In this paper, we show that in the multicore era, load balancing is no longer synonymous to optimization and present recent methods and algorithms for optimization of parallel applications for performance and energy on modern HPC platforms, which do not rely on load balancing and often return imbalanced but optimal solutions.We also show that some fundamental assumptions about performance events, which have to be true for the model-based power/energy measurement tools to be accurate, are increasingly difficult to satisfy as the number of CPU cores increases. Therefore, energy-aware computing methods relying on these tools will be increasingly difficult to verify.

### Impact of Approximate Memory Data Allocation on a H.264 Software Video Encoder

This paper describes the analysis, in terms of tolerance to errors on data, of a H.264 software video encoder; proposes a strategy to select data structures for approximate memory allocation and reports the impact on output video quality. Applications that tolerate errors on their data structures are known as ETA (Error Tolerant Applications) and have an important part in pushing interest on approximate computing research. We centered our study on H.264 video encoding, a video compression format developed for use in high definition systems, and today one of the most widespread video compression standard, used for broadcast, consumer and mobile applications. While data fault resilience of H.264 has already been studied considering unwanted and random faults due to unreliable hardware platforms, an analysis, considering controlled hardware faults and the corresponding energy quality tradeoff, has never been proposed.

Giulia Stazi, Lorenzo Adani, Antonio Mastrandrea, Mauro Olivieri, Francesco Menichelli

### Residual Replacement in Mixed-Precision Iterative Refinement for Sparse Linear Systems

We investigate the solution of sparse linear systems via iterative methods based on Krylov subspaces. Concretely, we combine the use of extended precision in the outer iterative refinement with a reduced precision in the inner Conjugate Gradient solver. This method is additionally enhanced with different residual replacement strategies that aim to avoid the pitfalls due to the divergence between the actual residual and the recurrence formula for this parameter computed during the iteration. Our experiments using a significant part of the SuiteSparse Matrix Collection illustrate the potential benefits of this technique from the point of view, for example, of energy and performance.

Hartwig Anzt, Goran Flegar, Vedran Novaković, Enrique S. Quintana-Ortí, Andrés E. Tomás

### Training Deep Neural Networks with Low Precision Input Data: A Hurricane Prediction Case Study

Training deep neural networks requires huge amounts of data. The next generation of intelligent systems will generate and utilise massive amounts of data which will be transferred along machine learning workflows. We study the effect of reducing the precision of this data at early stages of the workflow (i.e. input) on both prediction accuracy and learning behaviour of deep neural networks. We show that high precision data can be transformed to low precision before feeding it to a neural network model with insignificant depreciation in accuracy. As such, a high precision representation of input data is not entirely necessary for some applications. The findings of this study pave way for the application of deep learning in areas where acquiring high precision data is difficult due to both memory and computational power constraints. We further use a hurricane prediction case study where we predict the monthly number of hurricanes on the Atlantic Ocean using deep neural networks. We train a deep neural network model that predicts the number of hurricanes, first, by using high precision input data and then by using low precision data. This leads to only a drop in prediction accuracy of less than 2%.

Albert Kahira, Leonardo Bautista Gomez, Rosa M. Badia

### A Transparent View on Approximate Computing Methods for Tuning Applications

Approximation-tolerant applications give a system designer the possibility to improve traditional design values by slightly decreasing the quality of result. Approximate computing methods introduced for various system layers present the right tools to exploit this potential. However, finding a suitable tuning for a set of methods during design or run time according to the constraints and the system state is tough. Therefore, this paper presents an approach that leads to a transparent view on different approximation methods. This transparent and abstract view can be exploited by tuning approaches to find suitable parameter settings for the current purpose. Furthermore, the presented approach takes multiple objectives and conventional methods, which influence traditional design values, into account. Besides this novel representation approach, this paper introduces a first tuning approach exploiting the presented approach.

Michael Bromberger, Wolfgang Karl

### Exploring the Effects of Code Optimizations on CPU Frequency Margins

Chip manufactures introduce redundancy at various levels of CPU design to guarantee correct operation even for worst-case combinations of non-idealities in process variation and system operation conditions. This redundancy is implemented partly in the form of voltage/frequency margins. However, for a wide range of real-world execution scenarios, these margins are excessive and translate to increased power and energy consumption. Among the various factors that affect the degree to which these margins are actually needed to avoid errors during program execution, the impact of compiler and source code optimizations has not been explored yet. In this work, we study the effect of such optimizations on the frequency margins and the energy efficiency of applications in the ARM Cortex-A53 processor.

Konstantinos Parasyris, Nikolaos Bellas, Christos D. Antonopoulos, Spyros Lalis

### Taking Gradients Through Experiments: LSTMs and Memory Proximal Policy Optimization for Black-Box Quantum Control

In this work we introduce a general method to solve quantum control tasks as an interesting reinforcement learning problem not yet discussed in the machine learning community. We analyze the structure of the reinforcement learning problems typically arising in quantum physics and argue that agents parameterized by long short-term memory (LSTM) networks trained via stochastic policy gradients yield a versatile method to solving them. In this context we introduce a variant of the proximal policy optimization (PPO) algorithm called the memory proximal policy optimization (MPPO) which is based on the previous analysis. We argue that our method can by design be easily combined with numerical simulations as well as real experiments providing the reward signal. We demonstrate how the method can incorporate physical domain knowledge and present results of numerical experiments showing that it achieves state-of-the-art performance for several learning tasks in quantum control with discrete and continuous control parameters.

Moritz August, José Miguel Hernández-Lobato

### Towards Prediction of Turbulent Flows at High Reynolds Numbers Using High Performance Computing Data and Deep Learning

In this paper, deep learning (DL) methods are evaluated in the context of turbulent flows. Various generative adversarial networks (GANs) are discussed with respect to their suitability for understanding and modeling turbulence. Wasserstein GANs (WGANs) are then chosen to generate small-scale turbulence. Highly resolved direct numerical simulation (DNS) turbulent data is used for training the WGANs and the effect of network parameters, such as learning rate and loss function, is studied. Qualitatively good agreement between DNS input data and generated turbulent structures is shown. A quantitative statistical assessment of the predicted turbulent fields is performed.

Mathis Bode, Michael Gauding, Jens Henrik Göbbert, Baohao Liao, Jenia Jitsev, Heinz Pitsch

### Using a Graph Visualization Tool for Parallel Program Dynamic Visualization and Communication Analysis

Parallel program visualization and performance analysis tools have a high cost of development. As a consequence, there are many of these tools that are proprietary what makes difficult their adoption by the general community. This work introduces the use of general purpose open software for visualization and characterization of parallel programs. In particular, the use of an open graph visualization tool is presented as a case study for the dynamic communication characterization of a NAS parallel benchmark. The results show that a general purpose open graph tool could be used to analyze some important aspects related to the communication of parallel message passing programs.

Denise Stringhini, Pedro Spoljaric Gomes, Alvaro Fazenda

Shared virtual memory simplifies heterogeneous platform programming by enabling sharing of memory address pointers between heterogeneous devices in the platform. The most advanced implementations present a coherent view of memory to the programmer over the whole virtual address space of the process. From the point of view of data accesses, this System SVM (SSVM) enables the same programming paradigm in heterogeneous platforms as found in homogeneous platforms. C++ revision 17 adds its first features for explicit parallelism through its “Parallel Standard Template Library” (PSTL). This paper discusses the technical issues in offloading PSTL on heterogeneous platforms supporting SSVM and presents a working GCC-based proof-of-concept implementation. Initial benchmarking of the implementation on an AMD Carrizo platform shows speedups from 1.28X to 12.78X in comparison to host-only sequential STL execution.

Pekka Jääskeläinen, John Glossner, Martin Jambor, Aleksi Tervo, Matti Rintala

### Lessons Learned from a Decade of Providing Interactive, On-Demand High Performance Computing to Scientists and Engineers

For decades, the use of HPC systems was limited to those in the physical sciences who had mastered their domain in conjunction with a deep understanding of HPC architectures and algorithms. During these same decades, consumer computing device advances produced tablets and smartphones that allow millions of children to interactively develop and share code projects across the globe. As the HPC community faces the challenges associated with guiding researchers from disciplines using high productivity interactive tools to effective use of HPC systems, it seems appropriate to revisit the assumptions surrounding the necessary skills required for access to large computational systems. For over a decade, MIT Lincoln Laboratory has been supporting interactive, on-demand high performance computing by seamlessly integrating familiar high productivity tools to provide users with an increased number of design turns, rapid prototyping capability, and faster time to insight. In this paper, we discuss the lessons learned while supporting interactive, on-demand high performance computing from the perspectives of the users and the team supporting the users and the system. Building on these lessons, we present an overview of current needs and the technical solutions we are building to lower the barrier to entry for new users from the humanities, social, and biological sciences.

Julia Mullen, Albert Reuther, William Arcand, Bill Bergeron, David Bestor, Chansup Byun, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael Jones, Anna Klein, Peter Michaleas, Lauren Milechin, Andrew Prout, Antonio Rosa, Siddharth Samsi, Charles Yee, Jeremy Kepner

### Enabling Interactive Supercomputing at JSC Lessons Learned

Research and analysis of large amounts of data from scientific simulations, in-situ visualization, and application control are convincing scenarios for interactive supercomputing. The open-source software Jupyter (or JupyterLab) is a tool that has already been used successfully in many scientific disciplines. With its open and flexible web-based design, Jupyter is ideal for combining a wide variety of workflows and programming methods in a single interface. The multi-user capability of Jupyter via JuypterHub excels it for scientific applications at supercomputing centers. It combines the workspace that is local to the user and the corresponding workspace on the HPC systems. In order to meet the requirements for more interactivity in supercomputing and to open up new possibilities in HPC, a simple and direct web access for starting and connecting to login or compute nodes with Jupyter or JupyterLab at Jülich Supercomputing Centre (JSC) is presented. To corroborate the flexibility of the new method, the motivation, applications, details and challenges of enabling interactive supercomputing, as well as goals and prospective future work will be discussed.

Jens Henrik Göbbert, Tim Kreuzer, Alice Grosch, Andreas Lintermann, Morris Riedel

### Interactive Distributed Deep Learning with Jupyter Notebooks

Deep learning researchers are increasingly using Jupyter notebooks to implement interactive, reproducible workflows with embedded visualization, steering and documentation. Such solutions are typically deployed on small-scale (e.g. single server) computing systems. However, as the sizes and complexities of datasets and associated neural network models increase, high-performance distributed systems become important for training and evaluating models in a feasible amount of time. In this paper we describe our vision for Jupyter notebook solutions to deploy deep learning workloads onto high-performance computing systems. We demonstrate the effectiveness of notebooks for distributed training and hyper-parameter optimization of deep neural networks with efficient, scalable backends.

Steve Farrell, Aaron Vose, Oliver Evans, Matthew Henderson, Shreyas Cholia, Fernando Pérez, Wahid Bhimji, Shane Canon, Rollin Thomas, Prabhat

### Performance Portability of Earth System Models with User-Controlled GGDML Code Translation

The increasing need for performance of earth system modeling and other scientific domains pushes the computing technologies in diverse architectural directions. The development of models needs technical expertise and skills of using tools that are able to exploit the hardware capabilities. The heterogeneity of architectures complicates the development and the maintainability of the models.To improve the software development process of earth system models, we provide an approach that simplifies the code maintainability by fostering separation of concerns while providing performance portability. We propose the use of high-level language extensions that reflect scientific concepts. The scientists can use the programming language of their own choice to develop models, however, they can use the language extensions optionally wherever they need. The code translation is driven by configurations that are separated from the model source code. These configurations are prepared by scientific programmers to optimally use the machine’s features.The main contribution of this paper is the demonstration of a user-controlled source-to-source translation technique of earth system models that are written with higher-level semantics. We discuss a flexible code translation technique that is driven by the users through a configuration input that is prepared especially to transform the code, and we use this technique to produce OpenMP or OpenACC enabled codes besides MPI to support multi-node configurations.

Nabeeh Jum’ah, Julian Kunkel

### Evaluating Performance Portability of Accelerator Programming Models using SPEC ACCEL 1.2 Benchmarks

As heterogeneous architectures are becoming mainstream for HPC systems, application programmers are looking for programming model implementations that offer both performance and portability across platforms. Two directive-based programming models for accelerator programming that aim at doing this are OpenMP 4/4.5 and OpenACC. Many users want to know the difference between these two programming models, the state of their implementations, how to use them, and evaluate how suitable they are for their applications.The Standard Performance Evaluation Corporation (SPEC) ACCEL benchmarks, developed by the SPEC High Performance Group (HPG), recently released SPEC ACCEL 1.2 benchmark suite to help the evaluation of OpenCL, OpenMP 4.5 and OpenACC on different platforms. In this paper we present our preliminary results that evaluates OpenMP 4.5 and OpenACC on a variety of accelerator-based systems: POWER9 with NVIDIA V100 GPUs (Summit), Intel Xeon Phi 7230 (Percival), and AMD Bulldozer Opteron with NVIDIA K20x (Titan). Comparing these benchmarks on different systems gives us insight into the support for OpenMP and OpenACC and their execution times provide insights about their quality of implementations provided by different vendors. We also compare best of OpenMP and OpenACC to see if a particular programming model favors a particular type of benchmark kernel.

Swen Boehm, Swaroop Pophale, Verónica G. Vergara Larrea, Oscar Hernandez

### A Beginner’s Guide to Estimating and Improving Performance Portability

Given the increasing diversity of multi- and many-core processors, portability is a desirable feature of applications designed and implemented for such platforms. Portability is unanimously seen as a productivity enabler, but it is also considered a major performance blocker. Thus, performance portability has emerged as the property of an application to preserve similar form and similar performance on a set of platforms; a first metric, based on extensive evaluation, has been proposed to quantify performance portability for a given application on a set of given platforms.In this work, we explore the challenges and limitations of this performance portability metric (PPM) on two levels. We first use 5 OpenACC applications and 3 platforms, and we demonstrate how to compute and interpret PPM in this context. Our results indicate specific challenges in parameter selection and results interpretation. Second, we use controlled experiments to assess the impact of platform-specific optimizations on both performance and performance portability. Our results illustrate, for our 5 OpenACC applications, a clear tension between performance improvement and performance portability improvement.

Henk Dreuning, Roel Heirman, Ana Lucia Varbanescu

### Profiling and Debugging Support for the Kokkos Programming Model

Supercomputing hardware is undergoing a period of significant change. In order to cope with the rapid pace of hardware and, in many cases, programming model innovation, we have developed the Kokkos Programming Model – a C++-based abstraction that permits performance portability across diverse architectures. Our experience has shown that the abstractions developed can significantly frustrate debugging and profiling activities because they break expected code proximity and layout assumptions. In this paper we present the Kokkos Profiling interface, a lightweight, suite of hooks to which debugging and profiling tools can attach to gain deep insights into the execution and data structure behaviors of parallel programs written to the Kokkos interface.

Simon D. Hammond, Christian R. Trott, Daniel Ibanez, Daniel Sunderland

### Backmatter

Weitere Informationen