Skip to main content
Top

2018 | Book

High Performance Computing

33rd International Conference, ISC High Performance 2018, Frankfurt, Germany, June 24-28, 2018, Proceedings

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 33rd International Conference, ISC High Performance 2018, held in Frankfurt, Germany, in June 2018.

The 20 revised full papers presented in this book were carefully reviewed and selected from 81 submissions. The papers cover the following topics: Resource Management and Energy Efficiency; Performance Analysis and Tools; Exascale Networks; Parallel Algorithms.

Table of Contents

Frontmatter

Resource Management and Energy Efficiency

Frontmatter
Heterogeneity-Aware Resource Allocation in HPC Systems
Abstract
In their march towards exascale performance, HPC systems are becoming increasingly more heterogeneous in an effort to keep power consumption at bay. Exploiting accelerators such as GPUs and MICs together with traditional processors to their fullest requires heterogeneous HPC systems to employ intelligent job dispatchers that go beyond the capabilities of those that have been developed for homogeneous systems. In this paper, we propose three new heterogeneity-aware resource allocation algorithms suitable for building job dispatchers for any HPC system. We use real workload traces extracted from the Eurora HPC system to analyze the performance of our allocators when they are coupled with different schedulers. Our experimental results show that significant improvements can be obtained in job response times and system throughput over solutions developed for homogeneous systems. Our study also helps to characterize the operating conditions in which heterogeneity-aware resource allocation becomes crucial for heterogeneous HPC systems.
Alessio Netti, Cristian Galleguillos, Zeynep Kiziltan, Alina Sîrbu, Ozalp Babaoglu
On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors
Abstract
This paper presents refinements to the execution-cache-memory performance model and a previously published power model for multicore processors. The combination of both enables a very accurate prediction of performance and energy consumption of contemporary multicore processors as a function of relevant parameters such as number of active cores as well as core and Uncore frequencies. Model validation is performed on Intel Sandy Bridge-EP, Broadwell-EP, and AMD Epyc processors. Production-related variations in chip quality are demonstrated through a statistical analysis of the fit parameters obtained on one hundred Broadwell-EP CPUs of the same model. Insights from the models are used to explain the performance- and energy-related behavior of the processors for scalable as well as saturating (i.e., memory-bound) codes. In the process we demonstrate the models’ capability to identify optimal operating points with respect to highest performance, lowest energy-to-solution, and lowest energy-delay product and identify a set of best practices for energy-efficient execution.
Johannes Hofmann, Georg Hager, Dietmar Fey
Bayesian Optimization of HPC Systems for Energy Efficiency
Abstract
Energy efficiency is a crucial factor in developing large supercomputers and cost-effective datacenters. However, tuning a system for energy efficiency is difficult because the power and performance are conflicting demands. We applied Bayesian optimization (BO) to tune a graphics processing unit (GPU) cluster system for the benchmark used in the Green500 list, a popular energy-efficiency ranking of supercomputers. The resulting benchmark score enabled our system, named “kukai”, to earn second place in the Green500 list in June 2017, showing that BO is a useful tool. By determining the search space with minimal knowledge and preliminary experiments beforehand, BO could automatically find a sufficiently good configuration. Thus, BO could eliminate laborious manual tuning work and reduce the occupancy time of the system for benchmarking. Because BO is a general-purpose method, it may also be useful for tuning any practical applications in addition to Green500 benchmarks.
Takashi Miyazaki, Issei Sato, Nobuyuki Shimizu
DTF: An I/O Arbitration Framework for Multi-component Data Processing Workflows
Abstract
Multi-component workflows, where one component performs a particular transformation with the data and passes it on to the next component, is a common way of performing complex computations. Using components as building blocks we can apply sophisticated data processing algorithms to large volumes of data. Because the components may be developed independently, they often use file I/O and the Parallel File System to pass data. However, as the data volume increases, file I/O quickly becomes the bottleneck in such workflows. In this work, we propose an I/O arbitration framework called DTF to alleviate this problem by silently replacing file I/O with direct data transfer between the components. DTF treats file I/O calls as I/O requests and performs I/O request matching to perform data movement. Currently, the framework works with PnetCDF-based multi-component workflows. It requires minimal modifications to applications and allows the user to easily control I/O flow via the framework’s configuration file.
Tatiana V. Martsinkevich, Balazs Gerofi, Guo-Yuan Lien, Seiya Nishizawa, Wei-keng Liao, Takemasa Miyoshi, Hirofumi Tomita, Yutaka Ishikawa, Alok Choudhary
Classifying Jobs and Predicting Applications in HPC Systems
Abstract
Next-generation supercomputers are expected to consume tens of MW of electric power. The power is expected to instantaneously fluctuate between several MW to tens of MW during their execution. This fluctuation can cause voltage drops in regional power grids and affect the operation of chillers and generators in the computer’s facility. Predicting such fluctuations in advance can aid the safe operation of power grids and facility. Because abrupt fluctuations and a high average of consumed power are application-specific features, it is important to identify an application before job execution. This paper provides a methodology for classifying executed jobs into applications. By this method, various statistics for each application such as the number of executions, runtime, resource usage, and power consumption can be examined. To estimate the power consumed because of job execution, we propose a method to predict application characteristics using submitted job scripts. We demonstrate that 328 kinds of applications are executed in 273,121 jobs and that the application can be predicted with an accuracy of approximately 92%.
Keiji Yamamoto, Yuichi Tsujita, Atsuya Uno

Performance Analysis and Tools

Frontmatter
A Survey of Programming Tools for D-Wave Quantum-Annealing Processors
Abstract
The rapid growth in the realized performance of D-Wave Systems’ annealing-based quantum processing units (QPUs) has sparked a surge in tools development to deliver the anticipated performance to application developers. In this survey we describe the tools that are available, their goals (e.g., performance or ease of use), the programming abstractions they expose, and their use for application development. The existing tools confirm the need for interfaces at a variety of points on the continuum between complexity and simplicity in using the QPU. Most of the current tools abstract the hardware’s native topology but generally not using existing interfaces that are familiar to typical programmers. To date, only a small number of applications have been implemented for QPUs. Our survey finds that tools provide potentially great leverage to enable more applications as long as the tools expose the appropriate abstractions and deliver the anticipated performance.
Scott Pakin, Steven P. Reinhardt
Compiler-Assisted Source-to-Source Skeletonization of Application Models for System Simulation
Abstract
Performance modeling of networks through simulation requires application endpoint models that inject traffic into the simulation models. Endpoint models today for system-scale studies consist mainly of post-mortem trace replay, but these off-line simulations may lack flexibility and scalability. On-line simulations running so-called skeleton applications run reduced versions of an application that generate traffic that is the same or similar to the full application. These skeleton apps have advantages for flexibility and scalability, but they often must be custom written for the simulator itself. Auto-skeletonization of existing application source code via compiler tools would provide endpoint models with minimal development effort. These source-to-source transformations have been only narrowly explored. We introduce a pragma language and corresponding Clang-driven source-to-source compiler that performs auto-skeletonization based on provided pragma annotations. We describe the compiler toolchain, validate the generated skeletons, and show scalability of the generated simulation models beyond 100 K endpoints for example MPI applications. Overall, we assert that our proposed auto-skeletonization approach and the flexible skeletons it produces can be an important tool in realizing balanced exascale interconnect designs.
Jeremiah J. Wilke, Joseph P. Kenny, Samuel Knight, Sebastien Rumley
Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning
Abstract
Modern distributed computing frameworks for cloud computing and high performance computing typically accelerate job performance by dividing a large job into small tasks for execution parallelism. Some tasks, however, may run far behind others, which jeopardize the job completion time. In this paper, we present Zeno, a novel system which automatically identifies and diagnoses stragglers for jobs by machine learning methods. First, the system identifies stragglers with an unsupervised clustering method which groups the tasks based on their execution time. It then uses a supervised rule learning algorithm to learn diagnosis rules inferring the stragglers with their resource assignment and usage data. Zeno is evaluated on traces from a Google’s Borg system and an Alibaba’s Fuxi system. The results demonstrate that our system is able to generate simple and easy-to-read rules with both valuable insights and decent performance in predicting stragglers.
Huanxing Shen, Cong Li
Applicability of the ECM Performance Model to Explicit ODE Methods on Current Multi-core Processors
Abstract
To support the portability of efficiency when bringing an application from scientific computing to a new HPC system, autotuning techniques are promising approaches. Ideally, these approaches are able to derive an efficient implementation for a specific HPC system by applying suitable program transformations. Often, a large number of implementations results, and the most efficient of these variants should be selected. In this article, we investigate performance modelling and prediction techniques which can support the selection process. These techniques may significantly reduce the selection effort, compared to extensive runtime tests. We apply the execution-cache-memory (ECM) performance model to numerical solution methods for ordinary differential equations (ODEs). In particular, we consider the question whether it is possible to obtain a performance prediction for the resulting implementation variants to support the variant selection. We investigate the accuracy of the prediction for different ODEs and different hardware platforms and show that the prediction is able to reliably select a set of fast variants and, thus, to limit the search space for possible later empirical tuning.
Johannes Seiferth, Christie Alappat, Matthias Korch, Thomas Rauber
Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems
Abstract
Parallel I/O hardware and software infrastructure is a key contributor to performance variability for applications running on large-scale HPC systems. This variability confounds efforts to predict application performance for characterization, modeling, optimization, and job scheduling. We propose a modeling approach that improves predictive ability by explicitly treating the variability and by leveraging the sensitivity of application parameters on performance to group applications with similar characteristics. We develop a Gaussian process-based machine learning algorithm to model I/O performance and its variability as a function of application and file system characteristics. We demonstrate the effectiveness of the proposed approach using data collected from the Edison system at the National Energy Research Scientific Computing Center. The results show that the proposed sensitivity-based models are better at prediction when compared with application-partitioned or unpartitioned models. We highlight modeling techniques that are robust to the outliers that can occur in production parallel file systems. Using the developed metrics and modeling approach, we provide insights into the file system metrics that have a significant impact on I/O performance.
Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert Latham, Robert Ross, Shane Snyder, Stefan M. Wild
Performance Optimization and Evaluation of Scalable Optoelectronics Application on Large Scale KNL Cluster
Abstract
“ARTED” is an advanced scientific code for electron dynamics simulation which has been ported to various large-scale parallel systems including the “K” Computer, the ex-fastest supercomputer in the world, and many other MPP and cluster systems.
In this paper, we describe ARTED’s code optimization and performance evaluation applied to a large-scale cluster with Intel’s latest many-core processor, KNL (Knights Landing), based on past research regarding porting ARTED to the KNC (Knights Corner) coprocessor. Code optimization for dominant computation has been thoroughly carried out in KNL to achieve the highest performance with detailed optimization such as memory access, vectorization for the AVX-512 instruction set, cache utilization, etc. For further tuning, we investigated various KNL-dedicated techniques such as combining MCDRAM/DDR4 memories and parallel vector summation.
After detailed performance tuning on each core to achieve up to 25% of theoretical peak in the kernel part with 3-D stencil computation, we evaluated the application performance on the full system (25 PFLOPS of theoretical peak) of the KNL cluster “Oakforest-PACS” which is the largest KNL-based cluster in the world using the Intel Omni-Path Architecture. It shows excellent weak scaling with a dominant Hamiltonian performance of up to 4 PFLOPS (16% efficiency of the system) in double precision irrespective of simulation size as well as reasonable strong scaling on material simulations requiring high degree of parallelism.
Yuta Hirokawa, Taisuke Boku, Mitsuharu Uemoto, Shunsuke A. Sato, Kazuhiro Yabana
A Novel Multi-level Integrated Roofline Model Approach for Performance Characterization
Abstract
With energy-efficient architectures, including accelerators and many-core processors, gaining traction, application developers face the challenge of optimizing their applications for multiple hardware features including many-core parallelism, wide processing vector-units and on-chip high-bandwidth memory. In this paper, we discuss the development and utilization of a new application performance tool based on an extension of the classical roofline-model for simultaneously profiling multiple levels in the cache-memory hierarchy. This tool presents a powerful visual aid for the developer and can be used to frame the many-dimensional optimization problem in a tractable way. We show case studies of real scientific applications that have gained insights from the Integrated Roofline Model.
Tuomas Koskela, Zakhar Matveev, Charlene Yang, Adetokunbo Adedoyin, Roman Belenov, Philippe Thierry, Zhengji Zhao, Rahulkumar Gayatri, Hongzhang Shan, Leonid Oliker, Jack Deslippe, Ron Green, Samuel Williams
Hardware Performance Variation: A Comparative Study Using Lightweight Kernels
Abstract
Imbalance among components of large scale parallel simulations can adversely affect overall application performance. Software induced imbalance has been extensively studied in the past, however, there is a growing interest in characterizing and understanding another source of variability, the one induced by the hardware itself. This is particularly interesting with the growing diversity of hardware platforms deployed in high-performance computing (HPC) and the increasing complexity of computer architectures in general. Nevertheless, characterizing hardware performance variability is challenging as one needs to ensure a tightly controlled software environment.
In this paper, we propose to use lightweight operating system kernels to provide a high-precision characterization of various aspects of hardware performance variability. Towards this end, we have developed an extensible benchmarking framework and characterized multiple compute platforms (e.g., Intel x86, Cavium ARM64, Fujitsu SPARC64, IBM Power) running on top of lightweight kernel operating systems. Our initial findings show up to six orders of magnitude difference in relative variation among CPU cores across different platforms.
Hannes Weisbach, Balazs Gerofi, Brian Kocoloski, Hermann Härtig, Yutaka Ishikawa

Exascale Networks

Frontmatter
The Pitfalls of Provisioning Exascale Networks: A Trace Replay Analysis for Understanding Communication Performance
Abstract
Data movement is considered the main performance concern for exascale, including both on-node memory and off-node network communication. Indeed, many application traces show significant time spent in MPI calls, potentially indicating that faster networks must be provisioned for scalability. However, equating MPI times with network communication delays ignores synchronization delays and software overheads independent of network hardware. Using point-to-point protocol details, we explore the decomposition of MPI time into communication, synchronization and software stack components using architecture simulation. Detailed validation using Bayesian inference is used to identify the sensitivity of performance to specific latency/bandwidth parameters for different network protocols and to quantify associated uncertainties. The inference combined with trace replay shows that synchronization and MPI software stack overhead are at least as important as the network itself in determining time spent in communication routines.
Joseph P. Kenny, Khachik Sargsyan, Samuel Knight, George Michelogiannakis, Jeremiah J. Wilke
Megafly: A Topology for Exascale Systems
Abstract
In this paper we explore network topologies suitable for future exascale systems that need to support over fifty thousand endpoints. With the increased necessity to use optics at higher link speeds, some of the more traditional topologies, such as Tori and Fat-Trees, become prohibitively expensive at such large scale. We identify two cost efficient hierarchical topologies, one a canonical Dragonfly, and one a variant of the Dragonfly topology that we call Megafly. Megafly is an indirect hierarchical topology with high path diversity, flexible tapering options and an abundance of possible system design points. We describe and analyze the Megafly topology to understand its key features and advantages, when compared to the Dragonfly. Additionally, we define a Megafly tapering scheme that enables a good balance of system performance versus cost. Our evaluation shows that the Megafly topology achieves equal or better throughput than the Dragonfly on a variety of traffic patterns, while requiring only half of the virtual channels for deadlock-free routing. Megafly also provides better fairness, which is shown in the evaluation of synchronizing traffic patterns, such as neighbor exchanges. We also showcase the design flexibility and cost vs. performance trade-offs of Megafly in a mini case study that illustrates the challenges of building a high performance fabric topology.
Mario Flajslik, Eric Borch, Mike A. Parker
Packetization of Shared-Memory Traces for Message Passing Oriented NoC Simulation
Abstract
Several benchmark suites, which provide a wide spectrum of applications in relevant domains, have been proposed and widely used in the computer architecture community. In the majority of them, a shared-memory based communication model is assumed for communication among tasks/threads of an application. Yet, most of the works in the context of Network-on-Chip (NoC) architectures use these benchmarks as a basis for their experiments. Nevertheless, NoC architectures enable message passing communication that is not exploited by the applications in current benchmark suites. In this paper, we propose a technique for converting the trace of memory references generated by the execution of a shared memory based multi-threaded program to the trace of communication messages that would be obtained if the same program would have been designed to use message passing. The proposed technique is applied to a set of representative benchmarks belonging to SPLASH-2 and PARSEC benchmark suites.
Vincenzo Catania, Monteleone Salvatore, Maurizio Palesi, Davide Patti

Parallel Algorithms

Frontmatter
Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs
Abstract
Chebyshev filter diagonalization is well established in quantum chemistry and quantum physics to compute bulks of eigenvalues of large sparse matrices. Choosing a block vector implementation, we investigate optimization opportunities on the new class of high-performance compute devices featuring both high-bandwidth and low-bandwidth memory. We focus on the transparent access to the full address space supported by both architectures under consideration: Intel Xeon Phi “Knights Landing” and Nvidia “Pascal”/“Volta.” After a thorough performance analysis of the single-device implementations using the roofline model we propose two optimizations: (1) Subspace blocking is applied for improved performance and data access efficiency. We also show that it allows transparently handling problems much larger than the high-bandwidth memory without significant performance penalties. (2) Pipelining of communication and computation phases of successive subspaces is implemented to hide communication costs without extra memory traffic. As an application scenario we perform filter diagonalization studies for topological quantum matter. Performance numbers on up to 2048 nodes of the Oakforest-PACS and Piz Daint supercomputers are presented, achieving beyond 500 Tflop/s for computing \(10^2\) inner eigenvalues of sparse matrices of dimension \(4\cdot 10^9\).
Moritz Kreutzer, Dominik Ernst, Alan R. Bishop, Holger Fehske, Georg Hager, Kengo Nakajima, Gerhard Wellein
Combining HTM with RCU to Speed Up Graph Coloring on Multicore Platforms
Abstract
Graph algorithms are hard to parallelize, as they exhibit varying degrees of parallelism and perform irregular memory accesses. Graph coloring is a well studied problem, that colors the vertices of a graph, such that no adjacent vertices have the same color. This is a necessity for a large number of applications that require a coloring with few colors in near-linear time. In this work, we propose a simple and fast parallel graph coloring algorithm, well suited for shared memory architectures. Our algorithm employs Hardware Transactional Memory (HTM) to detect coloring inconsistencies between adjacent vertices, and exploits Read-Copy-Update (RCU) to enable high performance and ensure correctness.
We evaluate our algorithm on an Intel Haswell server using large-scale synthetic and real-world graphs, chosen to vary in terms of density and structure. With 14 threads, we achieved a geometric-mean speedup of 4.35 and a maximum speedup of 11.44.
Christina Giannoula, Georgios Goumas, Nectarios Koziris
Distributed Deep Reinforcement Learning: Learn How to Play Atari Games in 21 minutes
Abstract
We present a study in Distributed Deep Reinforcement Learning (DDRL) focused on scalability of a state-of-the-art Deep Reinforcement Learning algorithm known as Batch Asynchronous Advantage Actor-Critic (BA3C). We show that using the Adam optimization algorithm with a batch size of up to 2048 is a viable choice for carrying out large scale machine learning computations. This, combined with careful reexamination of the optimizer’s hyperparameters, using synchronous training on the node level (while keeping the local, single node part of the algorithm asynchronous) and minimizing the model’s memory footprint, allowed us to achieve linear scaling for up to 64 CPU nodes. This corresponds to a training time of 21 min on 768 CPU cores, as opposed to the 10 h required when using a single node with 24 cores achieved by a baseline single-node implementation.
Igor Adamski, Robert Adamski, Tomasz Grel, Adam Jędrych, Kamil Kaczmarek, Henryk Michalewski
TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism
Abstract
As chip multi-processors (CMPs) are becoming more and more complex, software solutions such as parallel programming models are attracting a lot of attention. Task-based parallel programming models offer an appealing approach to utilize complex CMPs. However, the increasing number of cores on modern CMPs is pushing research towards the use of fine grained parallelism. Task-based programming models need to be able to handle such workloads and offer performance and scalability. Using specialized hardware for boosting performance of task-based programming models is a common practice in the research community.
Our paper makes the observation that task creation becomes a bottleneck when we execute fine grained parallel applications with many task-based programming models. As the number of cores increases the time spent generating the tasks of the application is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX offers a solution for minimizing task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. We draw the requirements for this hardware in order to boost execution of highly parallel applications. From our evaluation using 11 parallel workloads on both symmetric and asymmetric multicore systems, we obtain performance improvements up to 15\(\times \), averaging to 3.1\(\times \) over the baseline.
Kallia Chronaki, Marc Casas, Miquel Moreto, Jaume Bosch, Rosa M. Badia
Backmatter
Metadata
Title
High Performance Computing
Editors
Rio Yokota
Michèle Weiland
David Keyes
Carsten Trinitis
Copyright Year
2018
Electronic ISBN
978-3-319-92040-5
Print ISBN
978-3-319-92039-9
DOI
https://doi.org/10.1007/978-3-319-92040-5