Skip to main content

2019 | Buch

High Performance Computing

34th International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16–20, 2019, Proceedings

herausgegeben von: Dr. Michèle Weiland, Dr. Guido Juckeland, Carsten Trinitis, Ponnuswamy Sadayappan

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 34th International Conference on High Performance Computing, ISC High Performance 2019, held in Frankfurt/Main, Germany, in June 2019.

The 17 revised full papers presented were carefully reviewed and selected from 70 submissions. The papers cover a broad range of topics such as next-generation high performance components; exascale systems; extreme-scale applications; HPC and advanced environmental engineering projects; parallel ray tracing - visualization at its best; blockchain technology and cryptocurrency; parallel processing in life science; quantum computers/computing; what's new with cloud computing for HPC; parallel programming models for extreme-scale computing; workflow management; machine learning and big data analytics; and deep learning and HPC.

Inhaltsverzeichnis

Frontmatter

Architectures, Networks and Infrastructure

Frontmatter
Evaluating Quality of Service Traffic Classes on the Megafly Network
Abstract
An emerging trend in High Performance Computing (HPC) systems that use hierarchical topologies (such as dragonfly) is that the applications are increasingly exhibiting high run-to-run performance variability. This poses a significant challenge for application developers, job schedulers, and system maintainers. One approach to address the performance variability is to use newly proposed network topologies such as megafly (or dragonfly+) that offer increased path diversity compared to a traditional fully connected dragonfly. Yet another approach is to use quality of service (QoS) traffic classes that ensure bandwidth guarantees. In this work, we select HPC application workloads that have exhibited performance variability on current 2-D dragonfly systems. We evaluate the baseline performance expectations of these workloads on megafly and 1-D dragonfly network models with comparably similar network configurations. Our results show that the megafly network, despite using fewer virtual channels (VCs) for deadlock avoidance than a dragonfly, performs as well as a fully connected 1-D dragonfly network. We then exploit the fact that megafly networks require fewer VCs to incorporate QoS traffic classes. We use bandwidth capping and traffic differentiation techniques to introduce multiple traffic classes in megafly networks. In some cases, our results show that QoS can completely mitigate application performance variability while causing minimal slowdown to the background network traffic.
Misbah Mubarak, Neil McGlohon, Malek Musleh, Eric Borch, Robert B. Ross, Ram Huggahalli, Sudheer Chunduri, Scott Parker, Christopher D. Carothers, Kalyan Kumaran

Artificial Intelligence and Machine Learning

Frontmatter
Densifying Assumed-Sparse Tensors
Improving Memory Efficiency and MPI Collective Performance During Tensor Accumulation for Parallelized Training of Neural Machine Translation Models
Abstract
Neural machine translation - using neural networks to translate human language - is an area of active research exploring new neuron types and network topologies with the goal of dramatically improving machine translation performance. Current state-of-the-art approaches, such as the multi-head attention-based transformer, require very large translation corpuses and many epochs to produce models of reasonable quality. Recent attempts to parallelize the official TensorFlow “Transformer” model across multiple nodes have hit roadblocks due to excessive memory use and resulting out of memory errors when performing MPI collectives.
This paper describes modifications made to the Horovod MPI-based distributed training framework to reduce memory usage for transformer models by converting assumed-sparse tensors to dense tensors, and subsequently replacing sparse gradient gather with dense gradient reduction. The result is a dramatic increase in scale-out capability, with CPU-only scaling tests achieving 91% weak scaling efficiency up to 1200 MPI processes (300 nodes), and up to 65% strong scaling efficiency up to 400 MPI processes (200 nodes) using the Stampede2 supercomputer.
Derya Cavdar, Valeriu Codreanu, Can Karakus, John A. Lockman III, Damian Podareanu, Vikram Saletore, Alexander Sergeev, Don D. Smith II, Victor Suthichai, Quy Ta, Srinivas Varadharajan, Lucas A. Wilson, Rengan Xu, Pei Yang
Learning Neural Representations for Predicting GPU Performance
Abstract
The graphic processing units (GPUs) have become a primary source of heterogeneity in today’s computing systems. With the rapid increase in number and types of GPUs available, finding the best hardware accelerator for each application is a challenge. For that matter, it is time consuming and tedious to execute every application on every GPU system to learn the correlation between application properties and hardware characteristics. To address this problem, we extend our previously proposed collaborating filtering based modeling technique, to build an analytical model which can predict performance of applications across different GPU systems. Our model learns representations, or embeddings (dense vectors of latent features) for applications and systems and uses them to characterize the performance of various GPU-accelerated applications. We improve state-of-the-art collaborative filtering approach based on matrix factorization by building a multi-layer perceptron. In addition to increased accuracy in predicting application performance, we can use this model to simultaneously predict multiple metrics such as rates of memory access operations. We evaluate our approach on a set of 30 well-known micro-applications and seven Nvidia GPUs. As a result, we can predict expected instructions per second value with 90.6% accuracy in average.
Shweta Salaria, Aleksandr Drozd, Artur Podobas, Satoshi Matsuoka

Data, Storage and Visualization

Frontmatter
SLOPE: Structural Locality-Aware Programming Model for Composing Array Data Analysis
Abstract
MapReduce brought on the Big Data revolution. However, its impact on scientific data analyses has been limited because of fundamental limitations in its data and programming models. Scientific data is typically stored as multidimensional arrays, while MapReduce is based on key-value (KV) pairs. Applying MapReduce to analyze array-based scientific data requires a conversion of arrays to KV pairs. This conversion incurs a large storage overhead and loses structural information embedded in the array. For example, analysis operations, such as convolution, are defined on the neighbors of an array element. Accessing these neighbors is straightforward using array indexes, but requires complex and expensive operations like self-join in the KV data model. In this work, we introduce a novel ‘structural locality’-aware programming model (SLOPE) to compose data analysis directly on multidimensional arrays. We also develop a parallel execution engine for SLOPE to transparently partition the data, to cache intermediate results, to support in-place modification, and to recover from failures. Our evaluations with real applications show that SLOPE is over ninety thousand times faster than Apache Spark and is \(38\%\) faster than TensorFlow.
Bin Dong, Kesheng Wu, Suren Byna, Houjun Tang
A Near-Data Processing Server Architecture and Its Impact on Data Center Applications
Abstract
Existing near-data processing (NDP) techniques have demonstrated their strength for some specific data-intensive applications. However, they might be inadequate for a data center server, which normally needs to perform a diverse range of applications from data-intensive to compute-intensive. How to develop a versatile NDP-powered server to support various data center applications remains an open question. Further, a good understanding of the impact of NDP on data center applications is still missing. For example, can a compute-intensive application also benefit from NDP? Which type of NDP engine is a better choice, an FPGA-based engine or an ARM-based engine? To address these issues, we first propose a new NDP server architecture that tightly couples each SSD with a dedicated NDP engine to fully exploit the data transfer bandwidth of an SSD array. Based on the architecture, two NDP servers ANS (ARM-based NDP Server) and FNS (FPGA-based NDP Server) are introduced. Next, we implement a single-engine prototype for each of them. Finally, we measure performance, energy efficiency, and cost/performance ratio of six typical data center applications running on the two prototypes. Some new findings have been observed.
Xiaojia Song, Tao Xie, Stephen Fischer
Comparing the Efficiency of In Situ Visualization Paradigms at Scale
Abstract
This work compares the two major paradigms for doing in situ visualization: in-line, where the simulation and visualization share the same resources, and in-transit, where simulation and visualization are given dedicated resources. Our runs vary many parameters, including simulation cycle time, visualization frequency, and dedicated resources, to study how tradeoffs change over configuration. In particular, we consider simulations as large as 1,024 nodes (16,384 cores) and dedicated visualization resources with as many as 512 nodes (8,192 cores). We draw conclusions about when each paradigm is superior, such as in-line being superior when the simulation cycle time is very fast. Surprisingly, we also find that in-transit can minimize the total resources consumed for some configurations, since it can cause the visualization routines to require fewer overall resources when they run at lower concurrency. For example, one of our scenarios finds that allocating 25% more resources for visualization allows the simulation to run 61% faster than its in-line comparator. Finally, we explore various models for quantifying the cost for each paradigm, and consider transition points when one paradigm is superior to the other. Our contributions inform design decisions for simulation scientists when performing in situ visualization.
James Kress, Matthew Larsen, Jong Choi, Mark Kim, Matthew Wolf, Norbert Podhorszki, Scott Klasky, Hank Childs, David Pugmire

Emerging Technologies

Frontmatter
Layout-Aware Embedding for Quantum Annealing Processors
Abstract
Due to the physical limit in connectivity between qubits in Quantum Annealing Processors (QAPs), when sampling from a problem formulated as an Ising graph model, it is necessary to embed the problem onto the physical lattice of qubits. A valid mapping of the problem nodes into qubits often requires qubit chains to ensure connectivity.
We introduce the concept of layout-awareness for embedding; wherein information about the layout of the input and target graphs is used to guide the allocation of qubits to each problem node. We then evaluate the consequent impact on the sampling distribution obtained from D-Wave’s QAP, and provide a set of tools to assist developers in targeting QAP architectures using layout-awareness. We quantify the results from a layout-agnostic and a layout-aware embedding algorithm on (a) the success rate and time at finding valid embeddings, (b) the metrics of the resulting chains and interactions, and (c) the energy profile of the annealing samples. The latter results are obtained by running experiments on a D-Wave Quantum Annealer, and are directly related to the ability of the device to solve complex problems.
Our technique effectively reduces the search space, which improves the time and success rate of the embedding algorithm and/or finds mappings that result in lower energy samples from the QAP. Together, these contributions are an important step towards an understanding of how near-future Computer-Aided Design (CAD) tools can work in concert with quantum computing technologies to solve previously intractable problems.
Jose P. Pinilla, Steven J. E. Wilton

HPC Algorithms

Frontmatter
Toward Efficient Architecture-Independent Algorithms for Dynamic Programs
Abstract
We argue that the recursive divide-and-conquer paradigm is highly suited for designing algorithms to run efficiently under both shared-memory (multi- and manycores) and distributed-memory settings. The depth-first recursive decomposition of tasks and data is known to allow computations with potentially high temporal locality, and automatic adaptivity when resource availability (e.g., available space in shared caches) changes during runtime. Higher data locality leads to better intra-node I/O and cache performance and lower inter-node communication complexity, which in turn can reduce running times and energy consumption. Indeed, we show that a class of grid-based parallel recursive divide-and-conquer algorithms (for dynamic programs) can be run with provably optimal or near-optimal performance bounds on fat cores (cache complexity), thin cores (data movements), and purely distributed-memory machines (communication complexity) without changing the algorithm’s basic structure.
Two-way recursive divide-and-conquer algorithms are known for solving dynamic programming (DP) problems on shared-memory multicore machines. In this paper, we show how to extend them to run efficiently also on manycore GPUs and distributed-memory machines.
Our GPU algorithms work efficiently even when the data is too large to fit into the host RAM. These are external-memory algorithms based on recursive r-way divide and conquer, where r (\(\ge 2\)) varies based on the current depth of the recursion. Our distributed-memory algorithms are also based on multi-way recursive divide and conquer that extends naturally inside each shared-memory multicore/manycore compute node. We show that these algorithms are work-optimal and have low latency and bandwidth bounds.
We also report empirical results for our GPU and distribute memory algorithms.
Mohammad Mahdi Javanmard, Pramod Ganapathi, Rathish Das, Zafar Ahmad, Stephen Tschudi, Rezaul Chowdhury

HPC Applications

Frontmatter
Petaflop Seismic Simulations in the Public Cloud
Abstract
During the last decade cloud services and infrastructure as a service became a popular solution for diverse applications. Additionally, hardware support for virtualization closed performance gaps, compared to on-premises, bare-metal systems. This development is driven by offloaded hypervisors and full CPU virtualization. Today’s cloud service providers, such as Amazon or Google, offer the ability to assemble application-tailored clusters to maximize performance. However, from an interconnect point of view, one has to tackle a 4–5\(\times \) slow-down in terms of bandwidth and 25\(\times \) in terms of latency, compared to latest high-speed and low-latency interconnects. Taking into account the high per-node and accelerator-driven performance of latest supercomputers, we observe that the network-bandwidth performance of recent cloud offerings is within 2\(\times \) of large supercomputers. In order to address these challenges, we present a comprehensive application-centric approach for high-order seismic simulations utilizing the ADER discontinuous Galerkin finite element method, which exhibits excellent communication characteristics. This covers the tuning of the operating system, normally not possible on supercomputers, micro-benchmarking, and finally, the efficient execution of our solver in the public cloud. Due to this performance-oriented end-to-end workflow, we were able to achieve 1.09 PFLOPS on 768 AWS c5.18xlarge instances, offering 27,648 cores with 5 PFLOPS of theoretical computational power. This correlates to an achieved peak efficiency of over 20% and a close-to 90% parallel efficiency in a weak scaling setup. In terms of strong scalability, we were able to strong-scale a science scenario from 2 to 64 instances with 60% parallel efficiency. This work is, to the best of our knowledge, the first of its kind at such a large scale.
Alexander Breuer, Yifeng Cui, Alexander Heinecke
MaLTESE: Large-Scale Simulation-Driven Machine Learning for Transient Driving Cycles
Abstract
Optimal engine operation during a transient driving cycle is the key to achieving greater fuel economy, engine efficiency, and reduced emissions. In order to achieve continuously optimal engine operation, engine calibration methods use a combination of static correlations obtained from dynamometer tests for steady-state operating points and road and/or track performance data. As the parameter space of control variables, design variable constraints, and objective functions increases, the cost and duration for optimal calibration become prohibitively large. In order to reduce the number of dynamometer tests required for calibrating modern engines, a large-scale simulation-driven machine learning approach is presented in this work. A parallel, fast, robust, physics-based reduced-order engine simulator is used to obtain performance and emission characteristics of engines over a wide range of control parameters under various transient driving conditions (drive cycles). We scale the simulation up to 3,906 nodes of the Theta supercomputer at the Argonne Leadership Computing Facility to generate data required to train a machine learning model. The trained model is then used to predict various engine parameters of interest, and the results are compared with those predicted by the engine simulator. Our results show that a deep-neural-network-based surrogate model achieves high accuracy: Pearson product-moment correlation values larger than 0.99 and mean absolute percentage error within 1.07% for various engine parameters such as exhaust temperature, exhaust pressure, nitric oxide, and engine torque. Once trained, the deep-neural-network-based surrogate model is fast for inference: it requires about 16 \(\upmu \)s for predicting the engine performance and emissions for a single design configuration compared with about 0.5 s per configuration with the engine simulator. Moreover, we demonstrate that transfer learning and retraining can be leveraged to incrementally retrain the surrogate model to cope with new configurations that fall outside the training data space.
Shashi M. Aithal, Prasanna Balaprakash

Performance Modeling and Measurement

Frontmatter
PerfMemPlus: A Tool for Automatic Discovery of Memory Performance Problems
Abstract
In high-performance computing many performance problems are caused by the memory system. Because such performance bugs are hard to identify, analysis tools play an important role in performance optimization. Today’s processors offer feature-rich performance monitoring units with support for instruction sampling. But existing tools only partially use this data. Previously, performance counters were used to measure the memory bandwidth. But the attribution of high bandwidth to source code has been difficult and imprecise. We introduce a novel method for identifying performance degrading bandwidth usage and attributing it to specific objects and source code lines. This paper also introduces a new method for false sharing detection. It can differentiate false and true sharing, identify objects and source code lines where the accesses to falsely shared objects are happening. It can uncover false sharing, which has been overlooked by previous tools. PerfMemPlus automatically reports those issues by using instruction sampling data captured with a single profiling run. This simplifies the tedious search for the location of performance problems in complex code. The tool design is simple, provides support for many existing and upcoming processors and the recorded data can be easily used in future research. We show that PerfMemPlus can automatically report performance problems without producing false positives. Additionally, we present case studies that show how PerfMemPlus can pinpoint memory performance problems in the PARSEC benchmarks and machine learning applications.
Christian Helm, Kenjiro Taura
GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications
Abstract
We present GPUMixer, a tool to perform mixed-precision floating-point tuning on scientific GPU applications. While precision tuning techniques are available, they are designed for serial programs and are accuracy-driven, i.e., they consider configurations that satisfy accuracy constraints, but these configurations may degrade performance. GPUMixer, in contrast, presents a performance-driven approach for tuning. We introduce a novel static analysis that finds Fast Imprecise Sets (FISets), sets of operations on low precision that minimize type conversions, which often yield performance speedups. To estimate the relative error introduced by GPU mixed-precision, we propose shadow computations analysis for GPUs, the first of this class for multi-threaded applications. GPUMixer obtains performance improvements of up to \(46.4\%\) of the ideal speedup in comparison to only \(20.7\%\) found by state-of-the-art methods.
Ignacio Laguna, Paul C. Wood, Ranvijay Singh, Saurabh Bagchi
Performance Exploration Through Optimistic Static Program Annotations
Abstract
Compilers are limited by the static information directly or indirectly encoded in the program. Low-level languages, such as https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-20656-7_13/MediaObjects/478393_1_En_13_Figa_HTML.gif , are considered problematic as their weak type system and relaxed memory semantic allows for various, sometimes non-obvious, behaviors. Since compilers have to preserve the program semantics for all program executions, the existence of exceptional behavior can prevent optimizations that the developer would consider valid and might expect. Analyses to guarantee the absence of disruptive and unlikely situations are consequently an indispensable part of an optimizing compiler. However, such analyses have to be approximative and limited in scope as global and exact solutions are infeasible for any non-trivial program.
In this paper, we present an automated tool to measure the effect missing static information has on the optimizations applied to a given program. The approach generates an optimistically optimized program version which, compared to the original, defines a performance gap that can be closed by better compiler analyses and selective static program annotations.
Our evaluation on six already tuned proxy applications for high-performance codes shows speedups of up to \(20.6\%\). This clearly indicates that static uncertainty limits performance. At the same time, we observed that compilers are often unable to utilize additional static information. Thus, manual annotation of all correct static information is therefore not only error prone but also mostly redundant.
Johannes Doerfert, Brian Homerding, Hal Finkel

Programming Models and Systems Software

Frontmatter
End-to-End Resilience for HPC Applications
Abstract
A plethora of resilience techniques have been investigated to protect application kernels. If, however, such techniques are combined and they interact across kernels, new vulnerability windows are created. This work contributes the idea of end-to-end resilience by protecting windows of vulnerability between kernels guarded by different resilience techniques. It introduces the live vulnerability factor (LVF), a new metric that quantifies any lack of end-to-end protection for a given data structure. The work further promotes end-to-end application protection across kernels via a pragma-based specification for diverse resilience schemes with minimal programming effort. This lifts the data protection burden from application programmers allowing them to focus solely on algorithms and performance while resilience is specified and subsequently embedded into the code through the compiler/library and supported by the runtime system. In experiments with case studies and benchmarks, end-to-end resilience has an overhead over kernel-specific resilience of less than \(3\%\) on average and increases protection against bit flips by a factor of three to four.
Arash Rezaei, Harsh Khetawat, Onkar Patil, Frank Mueller, Paul Hargrove, Eric Roman
Resilient Optimistic Termination Detection for the Async-Finish Model
Abstract
Driven by increasing core count and decreasing mean-time-to-failure in supercomputers, HPC runtime systems must improve support for dynamic task-parallel execution and resilience to failures. The async-finish task model, adapted for distributed systems as the asynchronous partitioned global address space programming model, provides a simple way to decompose a computation into nested task groups, each managed by a ‘finish’ that signals the termination of all tasks within the group.
For distributed termination detection, maintaining a consistent view of task state across multiple unreliable processes requires additional book-keeping when creating or completing tasks and finish-scopes. Runtime systems which perform this book-keeping pessimistically, i.e. synchronously with task state changes, add a high communication overhead compared to non-resilient protocols. In this paper, we propose optimistic finish, the first message-optimal resilient termination detection protocol for the async-finish model. By avoiding the communication of certain task and finish events, this protocol allows uncertainty about the global structure of the computation which can be resolved correctly at failure time, thereby reducing the overhead for failure-free execution.
Performance results using micro-benchmarks and the LULESH hydrodynamics proxy application show significant reductions in resilience overhead with optimistic finish compared to pessimistic finish. Our optimistic finish protocol is applicable to any task-based runtime system offering automatic termination detection for dynamic graphs of non-migratable tasks.
Sara S. Hamouda, Josh Milthorpe
Global Task Data-Dependencies in PGAS Applications
Abstract
Recent years have seen the emergence of two independent programming models challenging the traditional two-tier combination of message passing and thread-level work-sharing: partitioned global address space (PGAS) and task-based concurrency. In the PGAS programming model, synchronization and communication between processes are decoupled, providing significant potential for reducing communication overhead. At the same time, task-based programming allows to exploit a large degree of shared-memory concurrency. The inherent lack of fine-grained synchronization in PGAS can be addressed through fine-grained task synchronization across process boundaries. In this work, we propose the use of task data dependencies describing the data-flow in the global address space to synchronize the execution of tasks created in parallel on multiple processes. We present a description of the global data dependencies, describe the necessary interactions between the distributed scheduler instances required to handle them, and discuss our implementation in the context of the DASH https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-20656-7_16/478393_1_En_16_IEq1_HTML.gif PGAS framework. We evaluate our approach using the Blocked Cholesky Factorization and the LULESH proxy app, demonstrating the feasibility and scalability of our approach.
Joseph Schuchart, José Gracia
Finepoints: Partitioned Multithreaded MPI Communication
Abstract
The MPI multithreading model has been historically difficult to optimize; the interface that it provides for threads was designed as a process-level interface. This model has led to implementations that treat function calls as critical regions and protect them with locks to avoid race conditions. We hypothesize that an interface designed specifically for threads can provide superior performance than current approaches and even outperform single-threaded MPI.
In this paper, we describe a design for partitioned communication in MPI that we call finepoints. First, we assess the existing communication models for MPI two-sided communication and then introduce finepoints as a hybrid of MPI models that has the best features of each existing MPI communication model. In addition, “partitioned communication” created with finepoints leverages new network hardware features that cannot be exploited with current MPI point-to-point semantics, making this new approach both innovative and useful both now and in the future.
To demonstrate the validity of our hypothesis, we implement a finepoints library and show improvements against a state-of-the-art multithreaded optimized Open MPI implementation on a Cray XC40 with an Aries network. Our experiments demonstrate up to a 12\(\times \) reduction in wait time for completion of send operations. This new model is shown working on a nuclear reactor physics neutron-transport proxy-application, providing up to 26.1% improvement in communication time and up to 4.8% improvement in runtime over the best performing MPI communication mode, single-threaded MPI.
Ryan E. Grant, Matthew G. F. Dosanjh, Michael J. Levenhagen, Ron Brightwell, Anthony Skjellum
Backmatter
Metadaten
Titel
High Performance Computing
herausgegeben von
Dr. Michèle Weiland
Dr. Guido Juckeland
Carsten Trinitis
Ponnuswamy Sadayappan
Copyright-Jahr
2019
Electronic ISBN
978-3-030-20656-7
Print ISBN
978-3-030-20655-0
DOI
https://doi.org/10.1007/978-3-030-20656-7