Sustained Petascale Performance of Seismic Simulations with SeisSol on SuperMUC

Seismic simulations in realistic 3D Earth models require peta- or even exascale computing power to capture small-scale features of high relevance for scientific and industrial applications. In this paper, we present optimizations of SeisSol – a seismic wave propagation solver based on the Arbitrary high-order accurate DERivative Discontinuous Galerkin (ADER-DG) method on fully adaptive, unstructured tetrahedral meshes – to run simulations under production conditions at petascale performance. Improvements cover the entire simulation chain: from an enhanced ADER time integration via highly scalable routines for mesh input up to hardware-aware optimization of the innermost sparse-/dense-matrix kernels. Strong and weak scaling studies on the SuperMUC machine demonstrated up to 90% parallel efficiency and 45% floating point peak efficiency on 147k cores. For a simulation under production conditions (10

8

grid cells, 5·10

10

degrees of freedom, 5 seconds simulated time), we achieved a sustained performance of 1.09 PFLOPS.

Alexander Breuer, Alexander Heinecke, Sebastian Rettenberger, Michael Bader, Alice-Agnes Gabriel, Christian Pelties

SNAP: Strong Scaling High Fidelity Molecular Dynamics Simulations on Leadership-Class Computing Platforms

The rapidly improving compute capability of contemporary processors and accelerators is providing the opportunity for significant increases in the accuracy and fidelity of scientific calculations. In this paper we present performance studies of a new molecular dynamics (MD) potential called SNAP.

The SNAP potential has shown great promise in accurately reproducing physics and chemistry not described by simpler potentials. We have developed new algorithms to exploit high single-node concurrency provided by three different classes of machine: the Titan GPU-based system operated by Oak Ridge National Laboratory, the combined Sequoia and Vulcan BlueGene/Q machines located at Lawrence Livermore National Laboratory, and the large-scale Intel Sandy Bridge system, Chama, located at Sandia.

Our analysis focuses on strong scaling experiments with approximately 246,000 atoms over the range 1 −122,880 nodes on Sequoia/Vulcan and 40 −18,630 nodes on Titan. We compare these machine in terms of both simulation rate and power efficiency. We find that node performance correlates with power consumption across the range of machines, except for the case of extreme strong scaling, where more powerful compute nodes show greater efficiency.

This study is a unique assessment of a challenging, scientifically relevant calculation running on several of the world’s leading contemporary production supercomputing platforms.

Christian R. Trott, Simon D. Hammond, Aidan P. Thompson

Exascale Radio Astronomy: Can We Ride the Technology Wave?

The Square Kilometre Array (SKA) will be the most sensitive radio telescope in the world. This unprecedented sensitivity will be achieved by combining and analyzing signals from 262,144 antennas and 350 dishes at a raw datarate of petabits per second. The processing pipeline to create useful astronomical data will require exa-operations per second, at a very limited power budget. We analyze the compute, memory and bandwidth requirements for the key algorithms used in the SKA. By studying their implementation on existing platforms, we show that most algorithms have properties that map inefficiently on current hardware, such as a low compute-bandwidth ratio and complex arithmetic. In addition, we estimate the power breakdown on CPUs and GPUs, analyze the cache behavior on CPUs, and discuss possible improvements. This work is complemented with an analysis of supercomputer trends, which demonstrates that current efforts to use commercial off-the-shelf accelerators results in a two to three times smaller improvement in compute capabilities and power efficiency than custom built machines. We conclude that waiting for new technology to arrive will not give us the instruments currently planned in 2018: one or two orders of magnitude better power efficiency and compute capabilities are required. Novel hardware and system architectures, to match the needs and features of this unique project, must be developed.

Erik Vermij, Leandro Fiorin, Christoph Hagleitner, Koen Bertels

On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures

With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel’s Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important application area — structured grid codes — and investigated techniques for ensuring performance portability across a diverse range of different, high-end many-core architectures. We chose three codes to investigate: a 3D lattice Boltzmann code (D3Q19 BGK), the CloverLeaf hydrodynamics mini application from Sandia’s Mantevo benchmark suite, and ROTORSIM, a production-quality structured grid, multiblock, compressible finite-volume CFD code. We have developed OpenCL versions of these codes in order to provide cross-platform functional portability, and compared the performance of the OpenCL versions of these structured grid codes to optimized versions on each platform, including hybrid OpenMP/MPI/AVX versions on CPUs and Xeon Phi, and CUDA versions on NVIDIA GPUs. Our results show that, contrary to conventional wisdom, using OpenCL it is possible to achieve a high degree of performance portability, at least for structured grid applications, using a set of straightforward techniques. The performance portable code in OpenCL is also highly competitive with the best performance using the native parallel programming models on each platform.

Simon McIntosh-Smith, Michael Boulton, Dan Curran, James Price

Performance Predictions of Multilevel Communication Optimal LU and QR Factorizations on Hierarchical Platforms

In this paper we study the performance of two classical dense linear algebra algorithms, the LU and the QR factorizations, on multilevel hierarchical platforms. We note that we focus on multilevel QR factorization, and give a brief description of the multilevel LU factorization. We first introduce a performance model called Hierarchical Cluster Platform (

Hcp

), encapsulating the characteristics of such platforms. The focus is set on reducing the communication requirements of studied algorithms at each level of the hierarchy. Lower bounds on communication are therefore extended with respect to the

Hcp

model. We then present a multilevel QR factorization algorithm tailored for those platforms, and provide a detailed performance analysis. We also provide a set of performance predictions showing the need for such hierarchical algorithms on large platforms.

Laura Grigori, Mathias Jacquelin, Amal Khabou

Hourglass: A Bandwidth-Driven Performance Model for Sorting Algorithms

We develop a bandwidth-driven performance model (referred to as

Hourglass

) for sorting algorithms. The model quantifies

dominant data movements

inherent to sorting algorithms (e.g., accesses to/from large buffers and network communication) and estimates a lower-bound execution time. We validate the model with parallel radix sort and merge sort as well as multinode sample sort on leadership high-performance IBM architectures.

The model helps better understand the inherent bottlenecks in a sorting algorithm – the users can leverage this model to optimize software, redesign the algorithm, and/or analyze architectural what-if scenarios to explore innovative designs.

Doe Hyun Yoon, Fabrizio Petrini

Performance Analysis of Graph Algorithms on P7IH

IBM Power 775 (P7IH) is the latest supercomputing system that was designed for high-productivity and high-performance. The key innovation on the hub-chip based network makes it perform superior for traditional HPCC benchmarks. In this paper, we detailed characterize the bared network performance with a thin communication stack. Based on that, we present a systematic-al performance analysis of the data-intensive benchmark, Graph500’s Breadth First Search, on P7IH. We then provide insight into the overall interaction between hardware and software and present the lesson learned on the key bottlenecks of both architecture and data-intensive application.

Xinyu Que, Fabio Checconi, Fabrizio Petrini

Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver

The last decade has seen rapid growth of single-chip multiprocessors (CMPs), which have been leveraging Moore’s law to deliver high concurrency via increases in the number of cores and vector width. Modern CMPs execute from several hundreds to several thousands concurrent operations per second, while their memory subsystem delivers from tens to hundreds Giga-bytes per second bandwidth.

Taking advantage of these parallel resources requires highly tuned parallel implementations of key computational kernels, which form the back-bone of modern HPC. Sparse triangular solver is one such kernel and is the focus of this paper. It is widely used in several types of sparse linear solvers, and it is commonly considered challenging to parallelize and scale even on a moderate number of cores. This challenge is due to the fact that triangular solver typically has limited task-level parallelism and relies on fine-grain synchronization to exploit this parallelism, compared to data-parallel operations such as sparse matrix-vector multiplication.

This paper presents synchronization sparsification technique that significantly reduces the overhead of synchronization in sparse triangular solver and improves its scalability. We discover that a majority of task dependencies are redundant in task dependency graphs which are used to model the flow of computation in sparse triangular solver. We propose a fast and approximate sparsification algorithm, which eliminates more than 90% of these dependencies, substantially reducing synchronization overhead. As a result, on a 12-core Intel

®

Xeon

®

processor, our approach improves the performance of sparse triangular solver by 1.6x, compared to the conventional level-scheduling with barrier synchronization. This, in turn, leads to a 1.4x speedup in a pre-conditioned conjugate gradient solver.

Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, Pradeep Dubey

Scalability and Parallel Execution of OmpSs-OpenCL Tasks on Heterogeneous CPU-GPU Environment

With heterogeneous computing becoming mainstream, researchers and software vendors have been trying to exploit the best of the underlying architectures like GPUs or CPUs to enhance performance. Parallel programming models play a crucial role in achieving this enhancement. One such model is OpenCL, a parallel computing API for cross platform computations targeting heterogeneous architectures. However, OpenCL is a low-level programming language, therefore it can be time consuming to directly develop OpenCL code. To address this shortcoming, OpenCL has been integrated with OmpSs, a task-based programming model to provide abstraction to the user thereby reducing programmer effort. OmpSs-OpenCL programming model deals with a single OpenCL device either a CPU or a GPU. In this paper, we upgrade OmpSs-OpenCL programming model by supporting parallel execution of tasks across multiple CPU-GPU heterogeneous platforms. We discuss the design of the programming model along with its asynchronous runtime system. We investigated scalability of four OmpSs-OpenCL benchmarks across 4 GPUs gaining speedup of up to 4x. Further, in order to achieve effective utilization of the computing resources, we present static and work-stealing scheduling techniques. We show results of parallel execution of applications using OmpSs-OpenCL model and use heterogeneous workloads to evaluate our scheduling techniques on a heterogeneous CPU-GPU platform.

Vinoth Krishnan Elangovan, Rosa. M. Badia, Eduard Ayguadé

Automatic Exploration of Potential Parallelism in Sequential Applications

The multicore era has increased the need for highly parallel software. Since automatic parallelization turned out ineffective for many production codes, the community hopes for the development of tools that may assist parallelization, providing hints to drive the parallelization process. In our previous work, we had designed Tareador, a tool based on dynamic instrumentation that identifies potential task-based parallelism inherent in applications. Also, we showed how a programmer can use Tareador to explore the potential of different parallelization strategies. In this paper, we build up on our previous work by automating the process of exploring parallelism. We have designed an environment that, given a sequential code and configuration of the target parallel architecture, iteratively runs Tareador to find an efficient parallelization strategy. We propose an autonomous algorithm based on simple metrics and a cost function. The algorithm finds an efficient parallelization strategy and provides the programmer with sufficient information to turn that parallelization strategy into an actual parallel program.

Vladimir Subotic, Eduard Ayguadé, Jesus Labarta, Mateo Valero

CoreTSAR: Adaptive Worksharing for Heterogeneous Systems

The popularity of heterogeneous computing continues to increase rapidly due to the high peak performance, favorable energy efficiency, and comparatively low cost of accelerators. However, heterogeneous programming models still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP models, including OpenMP 4.0 and OpenACC, ease the migration of code from CPUs to GPUs but lack much of OpenMP’s flexibility: OpenMP applications can run on any number of CPUs without extra user effort, but GPU implementations do not offer similar adaptive worksharing across GPUs in a node, nor do they employ a mix of CPUs and GPUs. To address these shortcomings, we present CoreTSAR, our library for scheduling

core

s via a

t

ask-

s

ize

a

dapting

r

untime system by supporting worksharing of loop nests across arbitrary heterogeneous resources. Beyond scheduling the computational load across devices, CoreTSAR includes a memory-management system that operates based on task association, enabling the runtime to dynamically manage memory movement and task granularity. Our evaluation shows that CoreTSAR can provide nearly linear scaling to four GPUs and all cores in a node

without

modifying the code within the parallel region. Furthermore, CoreTSAR provides portable performance across a variety of system configurations.

Thomas R. W. Scogland, Wu-chun Feng, Barry Rountree, Bronis R. de Supinski

History-Based Predictive Instruction Window Weighting for SMT Processors

In a Simultaneous Multi-Threaded (SMT) processor environment, threads share datapath resources, and resource allocation policy directly affects the throughput metric. As a way of explicit resource management, resource requirements of threads are estimated based on several runtime statistics, such as cache miss counts, Issue Queue usage and efficiency metrics. Controlling processor resources indirectly by means of a fetch policy is also targeted in many recent studies. A successful technique, Speculative Instruction Window Weighting (SIWW), which speculates the weights of instructions in Issue Queue to indirectly manage SMT resource usage, is recently proposed. SIWW promises better peformance results compared to the well-accepted ICOUNT fetch policy. In this study, we propose an alternative fetch policy that implements SIWW-like logic using a history-based prediction mechanism, History-based Predictive Instruction Window Weighting (HPIWW), avoiding any types of speculation hardware and its inherent complexity. As a result, we show that HPIWW outperforms SIWW by 3% on the average across all simulated workloads, and dissipates 2.5 times less power than its rival.

Gurhan Kucuk, Gamze Uslu, Cagri Yesil

The Brand-New Vector Supercomputer, SX-ACE

Many of the current supercomputers tend to pursue higher peak performance, however, the characteristics of scientific applications are getting diversified, and their sustained performance strongly depends on not only the peak floating point operation performance of the system, but also its memory bandwidth. NEC’s goal is to provide superior sustained performance, especially for memory-intensive scientific applications. As the successor to the SX-9, its brand-new SX-ACE vector supercomputer has been developed to achieve this goal. The new vector processor features the world top-class single core performance of 64Gflop/s with the largest memory bandwidth of 64GB/s per core. Four cores, memory controllers, and a network controller are integrated into the SX-ACE processor, enabling the processor performance of 256Gflop/s with its memory bandwidth of 256GB/s. In order to gain a higher sustained performance, the system is equipped with a specialized network interconnecting processors, as well as a sophisticated vectorization compiler and an operating system.

Shintaro Momose, Takashi Hagiwara, Yoko Isobe, Hiroshi Takahara

Impact of Future Trends on Exascale Grid and Cloud Computing

This paper explores the impact of future trends on Exascale Grid/Cloud Computing systems and data-centers, including: (i) next-generation multi-core processors using 14 nm CMOS, (ii) next-generation photonic Integrated Circuit (IC) technologies, and (iii) a next-generation

Enhanced-Internet

’lean’ router. The new low-power processors offer ≈ 100 cores and a large embedded memory (eRAM) on one CMOS IC. Photonic ICs can potentially lower the size and energy requirements of systems significantly. The lean router supports deterministic TDM-based virtual circuit-switching in a packet-switched IP network, which lowers router buffer sizes and queueing latencies by a factor of 1,000. Our analysis indicates that an entire ’lean’ router (optical packet switch) can be fabricated on one photonic IC. Exascale roadmaps have called for: (i) energy reductions by factors of 100 by 2020, and (ii) predictive designs of ≈ 200 PetaFlop/sec systems which consume 15 MW by 2015. Using 2015 technology, we present high-level predictive designs of ≈ 100 PF/sec Grid and Cloud systems which use ≈ 13.1 MW for computation, and ≈ 0.5 and 1.5 MW for communications respectively.

T. H. Szymanski

SADDLE: A Modular Design Automation Framework for Cluster Supercomputers and Data Centres

In this paper we present SADDLE, a modular framework for automated design of cluster supercomputers and data centres. In contrast with commonly used approaches that operate on logic gate level (Verilog, VHDL) or board level (such as EDA tools), SADDLE works at a much higher level of abstraction: its building blocks are ready-made servers, network switches, power supply systems and so on. Modular approach provides the potential to include low-level tools as elements of SADDLE’s design workflow, moving towards the goal of electronic system level (ESL) design automation. Designs produced by SADDLE include project documentation items such as bills of materials and wiring diagrams, providing a formal specification of a computer system and streamlining assembly operations.

Konstantin S. Solnushkin

The SIOX Architecture – Coupling Automatic Monitoring and Optimization of Parallel I/O

Performance analysis and optimization of high-performance I/O systems is a daunting task. Mainly, this is due to the overwhelmingly complex interplay of the involved hardware and software layers. The Scalable I/O for Extreme Performance (SIOX) project provides a versatile environment for monitoring I/O activities and learning from this information. The goal of SIOX is to automatically suggest and apply performance optimizations, and to assist in locating and diagnosing performance problems.

In this paper, we present the current status of SIOX. Our modular architecture covers instrumentation of POSIX, MPI and other high-level I/O libraries; the monitoring data is recorded asynchronously into a global database, and recorded traces can be visualized. Furthermore, we offer a set of primitive plug-ins with additional features to demonstrate the flexibility of our architecture: A surveyor plug-in to keep track of the observed spatial access patterns; an fadvise plug-in for injecting hints to achieve read-ahead for strided access patterns; and an optimizer plug-in which monitors the performance achieved with different MPI-IO hints, automatically supplying the best known hint-set when no hints were explicitly set. The presentation of the technical status is accompanied by a demonstration of some of these features on our 20 node cluster. In additional experiments, we analyze the overhead for concurrent access, for MPI-IO’s 4-levels of access, and for an instrumented climate application.

While our prototype is not yet full-featured, it demonstrates the potential and feasibility of our approach.

Julian M. Kunkel, Michaela Zimmer, Nathanael Hübbe, Alvaro Aguilera, Holger Mickler, Xuan Wang, Andriy Chut, Thomas Bönisch, Jakob Lüttgau, Roman Michel, Johann Weging

Framework and Modular Infrastructure for Automation of Architectural Adaptation and Performance Optimization for HPC Systems

High performance systems have complex, diverse and rapidly evolving architectures. The span of applications, workloads, and resource use patterns is rapidly diversifying. Adapting applications for efficient execution on this spectrum of execution environments is effort intensive. There are many performance optimization tools which implement some or several aspects of the full performance optimization task but almost none are comprehensive across architectures, environments, applications, and workloads. This paper presents, illustrates, and applies a modular infrastructure which enables composition of multiple open-source tools and analyses into a set of workflows implementing comprehensive end-to-end optimization of a diverse spectrum of HPC applications on multiple architectures and for multiple resource types and parallel environments. It gives results from an implementation on the Stampede HPC system at the Texas Advanced Computing Center where a user can submit an application for optimization using only a single command line and get back an at least, partially optimized program without manual program modification for two different chips. Currently, only a subset of the possible optimizations is completely automated but this subset is rapidly growing. Case studies of applications of the workflow are presented. The implementations currently available for download as the PerfExpert tool version 4.0 supports both Sandy Bridge and Intel Phi chips.

Leonardo Fialho, James Browne

Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand: Early Experiences

The Dynamic Connected (DC) InfiniBand transport protocol has recently been introduced by Mellanox to address several shortcomings of the older Reliable Connection (RC), eXtended Reliable Connection (XRC), and Unreliable Datagram (UD) transport protocols. DC aims to support all of the features provided by RC — such as RDMA, atomics, and hardware reliability — while allowing processes to communicate with any remote process with just one DC queue pair (QP), like UD. In this paper we present the salient features of the new DC protocol including its connection and communication models.We design new verbs-level collective benchmarks to study the behavior of the new DC transport and understand the performance / memory trade-offs it presents. We then use this knowledge to propose multiple designs for MPI over DC. We evaluate an implementation of our design in the MVAPICH2 MPI library using standard MPI benchmarks and applications. To the best of our knowledge, this is the first such design of an MPI library over the new DC transport. Our experimental results at the microbenchmark level show that the DC-based design in MVAPICH2 is able to deliver 42% and 43% improvement in latency for large message All-to-one exchanges over XRC and RC respectively. DC-based designs are also able to give 20% and 8% improvement for small message One-to-all exchanges over RC and XRC respectively. For the All-to-all communication pattern, DC is able to deliver performance comparable to RC/XRC while outperforming in memory consumption. At the application level, for NAMD on 620 processes, the DC-based designs in MVAPICH2 outperform designs based on RC, XRC, and UD by 22%, 10%, and 13% respectively in execution time. With DL-POLY, DC outperforms RC and XRC by 75% and 30%, respectively, in total completion time while delivering performance similar to UD.

Hari Subramoni, Khaled Hamidouche, Akshey Venkatesh, Sourav Chakraborty, Dhabaleswar K. Panda

RADAR: Runtime Asymmetric Data-Access Driven Scientific Data Replication

Efficient I/O on large-scale spatiotemporal scientific data requires scrutiny of both the logical layout of the data (e.g., row-major vs. column-major) and the physical layout (e.g., distribution on parallel filesystems). For increasingly complex datasets, hand optimization is a difficult matter prone to error and not scalable to the increasing heterogeneity of analysis workloads. Given these factors, we present a partial data replication system called RADAR. We capture datatype- and collective-aware I/O access patterns (indicating logical access) via MPI-IO tracing and use a combination of coarse-grained and fine-grained performance modeling to evaluate and select optimized physical data distributions for the task at hand. Unlike conventional methods, we store all replica data and metadata, along with the original untouched data, under a single file container using the object abstraction in parallel filesystems. Our system results in manyfold improvements in some commonly used subvolume decomposition access patterns.Moreover, the modeling approach can determine whether such optimizations should be undertaken in the first place.

John Jenkins, Xiaocheng Zou, Houjun Tang, Dries Kimpe, Robert Ross, Nagiza F. Samatova

Fast Multiresolution Reads of Massive Simulation Datasets

Today’s massively parallel simulation codes can produce output ranging up to many terabytes of data. Utilizing this data to support scientific inquiry requires analysis and visualization, yet the sheer size of the data makes it cumbersome or impossible to read without computational resources similar to the original simulation. We identify two broad classes of problems for reading data and present effective solutions for both. The first class of data reads depends on user requirements and available resources. Tasks such as visualization and user-guided analysis may be accomplished using only a subset of variables with a restricted spatial extent at a reduced resolution. The other class of reads requires full resolution multivariate data to be loaded, for example to restart a simulation. We show that utilizing the hierarchical multiresolution IDX data format enables scalable and efficient serial and parallel read access on a variety of hardware from supercomputers down to portable devices. We demonstrate interactive view-dependent visualization and analysis of massive scientific datasets using low-power commodity hardware, and we compare read performance with other parallel file formats for both full and partial resolution data.

Sidharth Kumar, Cameron Christensen, John A. Schmidt, Peer-Timo Bremer, Eric Brugger, Venkatram Vishwanath, Philip Carns, Hemanth Kolla, Ray Grout, Jacqueline Chen, Martin Berzins, Giorgio Scorzelli, Valerio Pascucci

Rebasing I/O for Scientific Computing: Leveraging Storage Class Memory in an IBM BlueGene/Q Supercomputer

Storage class memory is receiving increasing attention for use in HPC systems for the acceleration of intensive IO operations. We report a particular instance using SLC FLASH memory integrated with an IBM BlueGene/Q supercomputer at scale (Blue Gene Active Storage, BGAS). We describe two principle modes of operation of the non-volatile memory: 1) block device; 2) direct storage access (DSA). The block device layer, built on the DSA layer, provides compatibility with IO layers common to existing HPC IO systems (POSIX, MPIO, HDF5) and is expected to provide high performance in bandwidth critical use cases. The novel DSA strategy enables a low-overhead, byte addressable, asynchronous, kernel by-pass access method for very high user space IOPs in multithreaded application environments. Here, we expose DSA through HDF5 using a custom file driver. Benchmark results for the different modes are presented and scale-out to full system size showcases the capabilities of this technology.

Felix Schürmann, Fabien Delalondre, Pramod S. Kumbhar, John Biddiscombe, Miguel Gila, Davide Tacchella, Alessandro Curioni, Bernard Metzler, Peter Morjan, Joachim Fenkes, Michele M. Franceschini, Robert S. Germain, Lars Schneidenbach, T. J. Christopher Ward, Blake G. Fitch

Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters

Optimization of access patterns using collective I/O imposes the overhead of exchanging data between processes. In a multi-core-based cluster the costs of inter-node and intra-node data communication are vastly different, and heterogeneity in the efficiency of data exchange poses both a challenge and an opportunity for implementing efficient collective I/O. The opportunity is to effectively exploit fast intra-node communication. We propose to improve communication locality for greater data exchange efficiency. However, such an effort is at odds with improving access locality for I/O efficiency, which can also be critical to collective-I/O performance. To address this issue we propose a framework,

Orthrus

, that can accommodate multiple collective-I/O implementations, each optimized for some performance aspects, and dynamically select the best performing one accordingly to current workload and system patterns. We have implemented Orthrus in the ROMIO library. Our experimental results with representative MPI-IO benchmarks on both a small dedicated cluster and a large production HPC system show that Orthrus can significantly improve collective I/O performance under various workloads and system scenarios.

Xuechen Zhang, Jianqiang Ou, Kei Davis, Song Jiang

Fast and Energy-efficient Breadth-First Search on a Single NUMA System

Breadth-first search (BFS) is an important graph analysis kernel. The Graph500 benchmark measures a computer’s BFS performance using the traversed edges per second (TEPS) ratio. Our previous nonuniform memory access (NUMA)-optimized BFS reduced memory accesses to remote RAM on a NUMA architecture system; its performance was 11 GTEPS (giga TEPS) on a 4-way Intel Xeon E5-4640 system. Herein, we investigated the computational complexity of the bottom-up, a major bottleneck in NUMA-optimized BFS. We clarify the relationship between vertex out-degree and bottom-up performance. In November 2013, our new implementation achieved a Graph500 benchmark performance of 37.66 GTEPS (fastest for a single node) on an SGI Altix UV1000 (one-rack) and 31.65 GTEPS (fastest for a single server) on a 4-way Intel Xeon E5-4650 system. Furthermore, we achieved the highest Green Graph500 performance of 153.17 MTEPS/W (mega TEPS per watt) on an Xperia-A SO-04E with a Qualcomm Snapdragon S4 Pro APQ8064.

Yuichiro Yasui, Katsuki Fujisawa, Yukinori Sato

Evaluation of the Impact of Direct Warm-Water Cooling of the HPC Servers on the Data Center Ecosystem

The last 10 years we have witnessed a rapid growth of the computational performance of servers used by the scientific community. This trend was especially visible in the HPC scene, where the price per FLOPS decreased, while the packing density and power consumption of the servers increased. This, in turn changed significantly challenges and costs of keeping the environmental conditions. Currently operational costs, mainly the power bill, over the lifetime of a computing system overshadow the acquisition costs. In addition, the overheads on the consumed power introduced by the need of cooling the systems may be as big as 40%. This is a huge portion of the costs, therefore, optimizations in this area should be beneficial in terms of both economy and efficiency. There are many approaches for optimizations of the costs, mainly focusing on the air cooling. Contrary to these have we decided to scrutinize a different approach. We planned to use warm (up to 45 °C inlet temperature) as the cooling medium for computing cluster and check if using this way of cooling can introduce significant savings and, at the same time, we can simplify the cooling infrastructure making it more robust and energy efficient. Additionally, in our approach we tried to use variable coolant temperature and flow to take maximum advantage of so called free cooling, minimizing the power consumption of the server-cooling loop pair.

To validate the hypothesis PSNC (Poznan Supercomputing and Networking Center) built a customized prototype system which consists of hybrid CPU and GPU computing cluster, provided by the company Iceotope, along with a customized, highly manageable and instrumented cooling loop.

In the paper we analyze the results of using our warm-water liquid cooled system to see if and, if it is the case, what are the positive and negative consequences for the data center ecosystem.

Radosław Januszewski, Norbert Meyer, Joanna Nowicka

A Case Study of Energy Aware Scheduling on SuperMUC

In this paper, we analyze the functionalities for energy aware scheduling of the IBM LoadLeveler resource management system on SuperMUC, one of the world’s fastest HPC systems. We explain how LoadLeveler predicts execution times and the average power consumption of the system’s workloads at varying CPU frequencies and compare the prediction to real measurements conducted on various benchmarks. Since the prediction model proves to be accurate for our application workloads, we can analyze the LoadLeveler predictions for a large fraction of the SuperMUC application portfolio. This enables us to define a policy for energy aware scheduling on SuperMUC, which selects the CPU frequencies considering the applications’ power and performance characteristics thereby providing an optimized tradeoff between energy savings and execution time.

Axel Auweter, Arndt Bode, Matthias Brehm, Luigi Brochard, Nicolay Hammer, Herbert Huber, Raj Panda, Francois Thomas, Torsten Wilde

Exploiting SIMD and Thread-Level Parallelism in Multiblock CFD

This paper presents the on-node performance tuning of a multi-block Euler solver for turbomachinery computations.

Our work focuses on vertical and horizontal scaling within an x86 multi-socket compute node by exploiting the fine grained parallelism available through SIMD instructions at core level and thread-level parallelism across the die through shared memory. We report on the challenges encountered in enabling efficient vectorization using both compiler directives and intrinsics with an emphasis on data structure transformations and their performance impact on vector computations.

Finally, we present the solver performance on different grid sizes running on Intel Sandy Bridge and Ivy Bridge processors.

Ioan Hadade, Luca di Mare

The Performance Characterization of the RSC PetaStream Module

The RSC PetaStream architecture is a massively parallel computer design based on Intel® Xeon® Phi manycore co-processors. Each RSC PetaStream module contains eight Intel Xeon Phi co-processors with PCI-express fabric and Infiniband interconnect for intermodule communication. This paper concentrates on the performance of a single RSC PetaStream module, evaluated with the help of low-level (point-to-point MPI), library (linear algebra, MAGMA) and application-level (classical molecular dynamics, GROMACS and LAMMPS codes) tests. The Intel Xeon E5-2690 top bin CPU dual-socket system has been used for comparison. This early evaluation demonstrates that in general each Xeon Phi co-processor of RSC PetaStream delivers approximately the same performance as dual-socket Intel Xeon E5 system, with only a half energy-to-solution. Fine-grain parallelism of Intel Xeon Phi cores takes advantage of higher messages exchange rates on MPI level for communication of threads placed on different Xeon Phi chips.

Andrey Semin, Egor Druzhinin, Vladimir Mironov, Alexey Shmelev, Alexander Moskovsky

Deploying Darter - A Cray XC30 System

The University of Tennessee, Knoxville acquired a Cray XC30 supercomputer, called Darter, with a peak performance of 248.9 Teraflops. Darter was deployed in late March of 2013 with a very aggressive production timeline - the system was deployed, accepted, and placed into production in only 2 weeks. The Spring Experiment for the Center for Analysis and Prediction of Storms (CAPS) largely drove the accelerated timeline, as the experiment was scheduled to start in mid-April. The Consortium for Advanced Simulation of Light Water Reactors (CASL) project also needed access and was able to meet their tight deadlines on the newly acquired XC30. Darter’s accelerated deployment and operations schedule resulted in substantial scientific impacts within the research community as well as immediate real-world impacts such as early severe tornado warnings [1].

Mark R. Fahey, Reuben Budiardja, Lonnie Crosby, Stephen McNally

Cyme: A Library Maximizing SIMD Computation on User-Defined Containers

This paper presents Cyme, a C++ library aiming at abstracting the usage of SIMD instructions while maximizing the usage of the underlying hardware. Unlike similar efforts such as Boost.simd or VC, Cyme provides generic high level containers to the users which hides SIMD complexity. Cyme accomplishes this by 1) optimization of the Abstract Syntax Tree using Expression Templates Programming to prevent temporary copies and maximize the use of Fuse Multiply Add instructions and 2) creating a data layout in memory (AoS or AoSoA), which minimizes data addressing and manipulation throughout all SIMD registers. Implementation of Cyme library has been accomplished on the IBM Blue Gene/Q architecture using the 256 bit SIMD extensions (QPX) of the Power A2 processor. Functionality of the library is demonstrated on a computationally intensive kernel of a neuro-scientific application where an increase of GFlop/s performance by a factor of 6.72 over the original implementation is observed using Clang compiler.

Timothée Ewart, Fabien Delalondre, Felix Schürmann

A Compiler-Assisted OpenMP Migration Method Based on Automatic Parallelizing Information

Performance of a serial code often relies on compilers’ capabilities for automatic parallelization. In such a case, the performance is not portable to a new system because a new compiler on the new system may be unable to effectively parallelize the ode originally developed assuming a particular target compiler. As the compiler messages from the target compiler are still useful to identify key kernels that should be optimized even for the different system, this paper proposes a method to migrate a serial code to the OpenMP programming model by using such compiler messages. The aim of the proposed method is to improve the performance portability across different systems and compilers. Experimental results indicate that the migrated OpenMP code can achieve a comparable or even better performance than the original code with automatic parallelization.

Kazuhiko Komatsu, Ryusuke Egawa, Hiroyuki Takizawa, Hiroaki Kobayashi

A Type-Oriented Graph500 Benchmark

Data intensive workloads have become a popular use of HPC in recent years and the question of how data scientists, who might not be HPC experts, can effectively program these machines is important to address. Whilst using models such as Partitioned Global Address Space (PGAS) is attractive from a simplicity point of view, the abstractions that these impose upon the programmer can impact performance. We propose an approach, type-oriented programming, where all aspects of parallelism are encoded via types and the type system which allows for the programmer to write simple PGAS data intensive HPC codes and then, if they so wish, tune the fundamental aspects by modifying type information. This paper considers the suitability of using type-oriented programming, with the PGAS memory model, in data intensive workloads. We compare a type-oriented implementation of the Graph500 benchmark against MPI reference implementations both in terms of programmability and performance, and evaluate how orienting their parallel codes around types can assist in the data intensive HPC field.

Nick Brown

A Dynamic Execution Model Applied to Distributed Collision Detection

The end of Dennard scaling and the looming Exascale challenges of efficiency, reliability, and scalability are driving a shift in programming methodologies away from conventional practices towards dynamic runtimes and asynchronous, data driven execution. Since Exascale machines are not yet available, however, experimental runtime systems and application co-design can expose application-specific overhead and scalability concerns at extreme scale, while also investigating the execution model defined by the runtime system itself. Such results may also contribute to the development of effective Exascale hardware.

This work presents a case study evaluating a dynamic, Exascale-inspired execution model and its associated experimental runtime system consisting of lightweight concurrent threads with dynamic management in the context of a global address space examining the problem of mesh collision detection. This type of problem constitutes an essential component of many CAD systems and industrial crash applications. The core of the algorithm depends upon determining if two triangles intersect in three dimensions for large meshes. The resulting collision detection algorithm exploits distributed memory to enable extremely large mesh simulation and is shown to be scalable thereby lending support to the execution strategy.

Matthew Anderson, Maciej Brodowicz, Luke Dalessandro, Jackson DeBuhr, Thomas Sterling

Implementation and Optimization of Three-Dimensional UPML-FDTD Algorithm on GPU Clusters

Co-processors with powerful floating-point operation capability have been used to study the electromagnetic simulations using the Finite Difference Time Domain (FDTD) method. This work focuses on the implementation and optimization of 3D UPML-FDTD parallel algorithm on GPU clusters. A set of techniques are utilized to optimize the FDTD algorithm, such as the application of GPU texture memory, asynchronization of data transfer between CPU and GPU. The performance of the parallel FDTD algorithm is tested on K20m GPU clusters. The scalability of the algorithm is tested for up to 80 NVIDIA Tesla K20m GPUs with the parallel efficiency up to 95%, and the optimization techniques explored in this study are found to improve the performance.

Lei Xu, Ying Xu

Real-Time Olivary Neuron Simulations on Dataflow Computing Machines

The Inferior-Olivary nucleus (ION) is a well-charted brain region, heavily associated with the sensorimotor control of the body. It comprises neural cells with unique properties which facilitate sensory processing and motor-learning skills. Simulations of such neurons become rapidly intractable when biophysically plausible models and meaningful network sizes (at least in the order of some hundreds of cells) are modeled. To overcome this problem, we accelerate a highly detailed ION network model using a Maxeler Dataflow Computing Machine. The design simulates a 330-cell network at real-time speed and achieves maximum throughputs of 24.7 GFLOPS. The Maxeler machine, integrating a Virtex-6 FPGA, yields speedups of ×92-102, and ×2-8 compared to a reference-C implementation, running on a Intel Xeon 2.66GHz, and a pure Virtex-7 FPGA implementation, respectively.

Georgios Smaragdos, Craig Davies, Christos Strydis, Ioannis Sourdis, Cătălin Ciobanu, Oskar Mencer, Chris I. De Zeeuw

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

The Tofu Interconnect 2 (Tofu2) is a system interconnect designed for the Fujitsu’s next generation successor to the PRIMEHPC FX10 supercomputer. Tofu2 inherited the 6-dimensional mesh/torus network topology from its predecessor, and it increases the link throughput by two and half times. It is integrated into a newly developed SPARC64

TM

processor chip and takes advantages of system-on-chip implementation by removing off-chip I/O between a processor chip and an interconnect controller. Tofu2 also introduces new features such as the atomic read-modify-write communication functions, the session-mode control queue for the offloading of collective communications, and harmless cache injection technique to reduce communication latency.

Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shunji Uno, Shinji Sumimoto, Kenichi Miura, Naoyuki Shida, Takahiro Kawashima, Takayuki Okamoto, Osamu Moriyama, Yoshiro Ikeda, Takekazu Tabata, Takahide Yoshikawa, Ken Seki, Toshiyuki Shimizu

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Regular Papers