Experience of a PRACE Center

High performance computing continues to reach higher levels of performance and will do so over the next 10 years. However, the costs of operating such large scale systems are also growing. This is mainly due to the fact that the power consumption has increased over time and keeps growing. While a top 10 system 15 years ago would be in the range of 100 KW the most recent list shows systems in the range of 1- 2 MW and more. This factor of 10 makes costs for power and cooling a significant issue. The second financial issue is the increase in investment cost for such systems.

Michael Resch

Achieving Exascale Computing through Hardware/Software Co-design

Several recent studies discuss potential Exascale architectures, identify key technical challenges and describe research that is beginning to address several of these challenges [1,2]. Co-design is a key element of the U.S. Department of Energy’s strategy to achieve Exascale computing [3]. Architectures research is needed but will not, by itself, meet the energy, memory, parallelism, locality and resilience hurdles facing the HPC community – system software and algorithmic innovation is needed as well. Since both architectures and software are expected to evolve significantly there is a potential to use the co-design methodology that has been developed by the embedded computing community. A new co-design methodology for high performance computing is needed.

Sudip Dosanjh, Richard Barrett, Mike Heroux, Arun Rodrigues

Will MPI Remain Relevant?

While the Message Passing Interface (MPI) is the ubiquitous standard used by parallel applications to satisfy their data movement and process control needs, the overarching domination of MPI is no longer a certainty. Drastic changes in computer architecture over the last several years, together with a decrease in overall reliability due to an increase in fault probability, will have a lasting effect on how we write portable and efficient parallel applications for future environments. The significant increase in the number of cores, memory hierarchies, relative cost of data transfers, and delocalization of computations to additional hardware, requires programming paradigms that are more dynamic and more portable than what we use today. The MPI 3.0 effort addresses some of these challenges, but other contenders with latent potential exist, quickly flattening the gaps between performance and portability.

George Bosilca

Exascale Algorithms for Generalized MPI_Comm_split

In the quest to build exascale supercomputers, designers are increasing the number of hierarchical levels that exist among system components. Software developed for these systems must account for the various hierarchies to achieve maximum efficiency. The first step in this work is to identify groups of processes that share common resources. We develop, analyze, and test several algorithms that can split millions of processes into groups based on arbitrary, user-defined data. We find that bitonic sort and our new hash-based algorithm best suit the task.

Adam Moody, Dong H. Ahn, Bronis R. de Supinski

Order Preserving Event Aggregation in TBONs

Runtime tools for MPI applications must gather information from all processes to a tool front-end for presentation. Scalability requies that tools aggregate and reduce this information so tool developers often use a Tree Based Overlay Network (TBON). TBONs aggregate multiple associated events through a hierarchical communication structure. We present a novel algorithm to execute multiple aggregations while, at the same time, preserving relevant event orders. We implement this algorithm in our tool infrastructure that provides TBON functionality as one of its services. We demonstrate that our approach provides scalability with experiments for up to 2048 tasks.

Tobias Hilbrich, Matthias S. Müller, Martin Schulz, Bronis R. de Supinski

Using MPI Derived Datatypes in Numerical Libraries

By way of example this paper examines the potential of MPI user-defined datatypes for distributed datastructure manipulation in numerical libraries. The three examples, namely gather/scatter of column-wise distributed two dimensional matrices, matrix transposition, and redistribution of doubly cyclically distributed matrices as used in the Elemental dense matrix library, show that distributed data structures can be conveniently expressed with the derived datatype mechanisms of MPI, yielding at the same time worthwhile performance advantages over straight-forward, handwritten implementations. Experiments have been performed with on different systems with mpich2 and OpenMPI library implementations. We report results for a SunFire X4100 system with the mvapich2 library. We point out cases where the current MPI collective interfaces do not provide sufficient functionality.

Enes Bajrović, Jesper Larsson Träff

Improving MPI Applications Performance on Multicore Clusters with Rank Reordering

Modern hardware architectures featuring multicores and a complex memory hierarchy raise challenges that need to be addressed by parallel applications programmers. It is therefore tempting to adapt an application communication pattern to the characteristics of the underlying hardware. The MPI standard features several functions that allow the ranks of MPI processes to be reordered according to a graph attached to a newly created communicator. In this paper, we explain how the MPICH2 implementation of the MPI_Dist_graph_create function was modified to reorder the MPI process ranks to create a match between the application communication pattern and the hardware topology. The experimental results on a multicore cluster show that improvements can be achieved as long as the application communication pattern is expressed by a relevant metric.

Guillaume Mercier, Emmanuel Jeannot

Multi-core and Network Aware MPI Topology Functions

MPI standard offers a set of topology-aware interfaces that can be used to construct graph and Cartesian topologies for MPI applications. These interfaces have been mostly used for topology construction and not for performance improvement. To optimize the performance, in this paper we use graph embedding and node/network architecture discovery modules to match the communication topology of the applications to the physical topology of multi-core clusters with multi-level networks. Micro-benchmark results show considerable improvement in communication performance when using weighted and network-aware mapping. We also show that the implementation can improve communication and execution time of the applications.

Mohammad Javad Rashti, Jonathan Green, Pavan Balaji, Ahmad Afsahi, William Gropp

Scalable Node Allocation for Improved Performance in Regular and Anisotropic 3D Torus Supercomputers

MPI application performance can vary based on the scheduler’s placing of ranks, whether between nodes or on cores in the same multi-core chip. MPI applications, by default, are at the mercy of the application placement software decision that assigns nodes to a job. We describe herein the general approach of node ordering for allocation in a 3D torus, how it improved MPI application performance, even in the face of an anisotropic interconnect. We demonstrate, quantitatively, that our topologically-based ordering results in improved performance for several MPI applications running on a Top10 supercomputer.

Carl Albing, Norm Troullier, Stephen Whalen, Ryan Olson, Joe Glenski, Howard Pritchard, Hugo Mills

Improving the Average Response Time in Collective I/O

In collective I/O, MPI processes exchange requests so that the rearranged requests can result in the shortest file system access time. Scheduling the exchange sequence determines the response time of participating processes. Existing implementations that simply follow the increasing order of file offsets do not necessary produce the best performance. To minimize the average response time, we propose three scheduling algorithms that consider the number of processes per file stripe and the number of accesses per process. Our experimental results demonstrate improvements of up to 50% in the average response time using two synthetic benchmarks and a high-resolution climate application.

Chen Jin, Saba Sehrish, Wei-keng Liao, Alok Choudhary, Karen Schuchardt

OMPIO: A Modular Software Architecture for MPI I/O

I/O is probably the most limiting factor on high-end machines for large scale parallel applications as of today. This paper introduces OMPIO, a new parallel I/O architecture for OpenMPI. OMPIO provides a highly modular approach to parallel I/O by separating I/O functionality into smaller units (frameworks) and an arbitrary number of modules in each framework. Furthermore, each framework has a customized selection criteria that determines which module to use depending on the functionality of the framework as well as external parameters.

Mohamad Chaarawi, Edgar Gabriel, Rainer Keller, Richard L. Graham, George Bosilca, Jack J. Dongarra

Design and Evaluation of Nonblocking Collective I/O Operations

Nonblocking operations have successfully been used to hide network latencies in large scale parallel applications. This paper presents the challenges associated with developing nonblocking collective I/O operations, in order to help hiding the costs of I/O operations. We also present an implementation based on the libNBC library, and evaluate the benefits of nonblocking collective I/O over a PVFS2 file system for a micro-benchmark and a parallel image processing application. Our results indicate the potential benefit of our approach, but also highlight the challenges to achieve appropriate overlap between I/O and compute operations.

Vishwanath Venkatesan, Mohamad Chaarawi, Edgar Gabriel, Torsten Hoefler

Optimizing MPI One Sided Communication on Multi-core InfiniBand Clusters Using Shared Memory Backed Windows

The Message Passing Interface (MPI) has been very popular for programming parallel scientific applications. As the multi-core architectures have become prevalent, a major question that has emerged is about the use of MPI within a compute node and its impact on communication costs. The one-sided communication interface in MPI provides a mechanism to reduce communication costs by removing matching requirements of the send/receive model. The MPI standard provides the flexibility to allocate memory windows backed by shared memory. However, state-of-the-art open-source MPI libraries do not leverage this optimization opportunity for commodity clusters. In this paper, we present a design and implementation of intra-node MPI one-sided interface using shared memory backed windows on multi-core clusters. We use MVAPICH2 MPI library for design, implementation and evaluation. Micro-benchmark evaluation shows that the new design can bring up to 85% improvement in Put, Get and Accumulate latencies, with passive synchronization mode. The bandwidth performance of Put and Get improves by 64% and 42%, respectively. Splash LU benchmark shows an improvement of up to 55% with the new design on 32 core Magny-cours node. It shows similar improvement on a 12 core Westmere node. The mean BFS time in Graph500 reduces by 39% and 77% on Magny-cours and Westmere nodes, respectively.

Sreeram Potluri, Hao Wang, Vijay Dhanraj, Sayantan Sur, Dhabaleswar K. Panda

A uGNI-Based MPICH2 Nemesis Network Module for the Cray XE

Recent versions of MPICH2 have featured Nemesis – a scalable, high-performance, multi-network communication subsystem. Nemesis provides a framework for developing Network Modules (Netmods) for interfacing the Nemesis subsystem to various high speed network protocols. Cray has developed a user-level Generic Network Interface (uGNI) for interfacing MPI implementations to the internal high speed network of Cray XE and follow-on computer systems. This paper describes the design of a uGNI Netmod for the MPICH2 nemesis subsystem. MPICH2 performance data on the Cray XE are presented.

Howard Pritchard, Igor Gorodetsky, Darius Buntinas

Using Triggered Operations to Offload Rendezvous Messages

Historically, MPI implementations have had to choose between eager messaging protocols that require buffering and rendezvous protocols that sacrifice overlap and strong independent progress in some scenarios. The typical choice is to use an eager protocol for short messages and switch to a rendezvous protocol for long messages. If overlap and progress are desired, some implementations offer the option of using a thread. We propose an approach that leverages

triggered operations

to implement a long message rendezvous protocol that provides strong progress guarantees. The results indicate that a triggered operation based rendezvous can achieve better overlap than a traditional rendezvous implementation and less wasted bandwidth than an eager long protocol.

Brian W. Barrett, Ron Brightwell, K. Scott Hemmert, Kyle B. Wheeler, Keith D. Underwood

pupyMPI - MPI Implemented in Pure Python

As distributed memory systems have become common, the de facto standard for communication is still the Message Passing Interface (MPI). pupyMPI is a pure Python implementation of a broad subset of the MPI 1.3 specifications that allows Python programmers to utilize multiple CPUs with datatypes and memory handled transparently. pupyMPI also implements a few non-standard extensions such as non-blocking collectives and the option of suspending, migrating and resuming the distributed computation of a pupyMPI program. This paper introduces pupyMPI and presents benchmarks against C implementations of MPI, which show acceptable performance.

Rune Bromer, Frederik Hantho, Brian Vinter

Scalable Memory Use in MPI: A Case Study with MPICH2

One of the factors that can limit the scalability of MPI to exascale is the amount of memory consumed by the MPI implementation. In fact, some researchers believe that existing MPI implementations, if used unchanged, will themselves consume a large fraction of the available system memory at exascale. To investigate and address this issue, we undertook a study of the memory consumed by the MPICH2 implementation of MPI, with a focus on identifying parts of the code where the memory consumed per process scales linearly with the total number of processes. We report on the findings of this study and discuss ways to avoid the linear growth in memory consumption. We also describe specific optimizations that we implemented in MPICH2 to avoid this linear growth and present experimental results demonstrating the memory savings achieved and the impact on performance.

David Goodell, William Gropp, Xin Zhao, Rajeev Thakur

Performance Expectations and Guidelines for MPI Derived Datatypes

MPI’s derived datatypes provide a powerful mechanism for concisely describing arbitrary, noncontiguous layouts of user data for use in MPI communication. This paper formulates

self-consistent performance guidelines

for derived datatypes. Such guidelines make performance expectations for derived datatypes explicit and suggest relevant optimizations to MPI implementers. We also identify self-consistent guidelines that are too strict to enforce, because they entail NP-hard optimization problems. Enforced self-consistent guidelines assure the user that certain manual datatype optimizations cannot lead to performance improvements, which in turn contributes to performance portability between MPI implementations that behave in accordance with the guidelines. We present results of tests with several MPI implementations, which indicate that many of them violate the guidelines.

William Gropp, Torsten Hoefler, Rajeev Thakur, Jesper Larsson Träff

The Analysis of Cluster Interconnect with the Network_Tests2 Toolkit

The article discusses MPI-2 tools for benchmarking and extracting information on features of interconnect in HPC clusters. Authors develop a toolkit named “network_tests2”. This toolkit highlights hidden cluster’s topology, illuminates the so-called “jump points” in latency during message transfer, allows user to search defective cluster nodes and so on. The toolkit consists of several programs. The first one is an MPI-program that performs message transfer in several modes to provide certain communication activity or benchmarking of a chosen MPI-function and collects some statistics. The output of this program is a set of communicative matrices which are stored as a NetCDF file. The toolkit includes programs that perform data clustering and provide GUI for visualisation and comparison of results obtained from different clusters. This article touches some results obtained from Russian supercomputers such as Lomonosov T500 system. We also present data on Infiniband Mellanox and Blue Gene/P interconnect technologies.

Alexey Salnikov, Dmitry Andreev, Roman Lebedev

Parallel Sorting with Minimal Data

For reasons of efficiency, parallel methods are normally used to work with as many elements as possible. Contrary to this preferred situation, some applications need the opposite. This paper presents three parallel sorting algorithms suited for the extreme case where every process contributes only a single element. Scalable solutions for this case are needed for the communicator constructor MPI_Comm_split. Compared to previous approaches requiring

O

(

p

) memory, we introduce two new parallel sorting algorithms working with a minimum of

O

(1) memory. One method is simple to implement and achieves a running time of

O

(

p

). Our scalable algorithm solves this sorting problem in

O

(log

2

p

) time.

Christian Siebert, Felix Wolf

Scaling Performance Tool MPI Communicator Management

The Scalasca toolset has successfully demonstrated measurement and analysis scalability on the largest computer systems, however, applications have growing complexity and increasing demands on performance tools. One such application is the

PFLOTRAN

code for simulating multiphase subsurface flow and reactive transport. While

PFLOTRAN

itself and Scalasca runtime summarization both scale well, MPI communicator management becomes critical for trace collection with tens of thousands of processes. Re-design and re-engineering of key components of the Scalasca measurement system are presented which encompass the representation of communicators, communicator definition tracking and unification, and translation of ranks recorded in event traces.

Markus Geimer, Marc-André Hermanns, Christan Siebert, Felix Wolf, Brian J. N. Wylie

Per-call Energy Saving Strategies in All-to-All Communications

With the increase in the peak performance of modern computing platforms, their energy consumption grows as well, which may lead to overwhelming operating costs and failure rates. Techniques, such as Dynamic Voltage and Frequency Scaling (called DVFS) and CPU Clock Modulation (called throttling) are often used to reduce the power consumption of the compute nodes. However, these techniques should be used judiciously during the application execution to avoid significant performance losses. In this work, two implementations of the

per-call

collective operations are studied as to their augmentation with energy saving strategies on the

per-call

basis. Experiments were performed on the OSU MPI benchmarks as well as on a few real-world problems from the CPMD and NAS suits, in which energy consumption was reduced by up to 10% and 15.7%, respectively, with little performance degradation.

Vaibhav Sundriyal, Masha Sosonkina

Data Redistribution Using One-sided Transfers to In-Memory HDF5 Files

Outputs of simulation codes making use of the HDF5 file format are usually and mainly composed of several different attributes and datasets, storing either lightweight pieces of information or containing heavy parts of data. These objects, when written or read through the HDF5 layer, create metadata and data IO operations of different block sizes, which depend on the precision and dimension of the arrays that are being manipulated. By making use of simple block redistribution strategies, we present in this paper a case study showing HDF5 IO performance improvements for “in-memory” files stored in a distributed shared memory buffer using one-sided communications through the HDF5 API.

Jerome Soumagne, John Biddiscombe, Aurélien Esnard

RCKMPI – Lightweight MPI Implementation for Intel’s Single-chip Cloud Computer (SCC)

The Single-chip Cloud Computer (SCC) is an experimental processor created by Intel Labs. It is a distributed memory architecture that provides shared memory possibilities and an on die Message Passing Buffer (MPB). This paper presents an MPI implementation (RCKMPI) that uses an efficient mix of MPB and DDR3 shared memory for low level communication. The on die buffer found in the SCC provides higher bandwidth and lower latency than the available shared memory. In spite of this, message passing can be faster through DDR3, due to protocol overheads related to the small size of the MPB and the necessity to split and reassemble large packages, together with the possibility that the data is not available in the cache. These overheads take over after certain message sizes, requiring run time decisions with regards to which type of buffers to use, in order to achieve higher performance. In the current implementation, the decision is based on remaining bytes to transfer from in transit packets. MPI benchmarks are shown to demonstrate that the use of both types of buffers results in equal or lower transmission times than when communicating through the on die buffer alone.

Isaías A. Comprés Ureña, Michael Riepen, Michael Konow

Hybrid OpenMP-MPI Turbulent Boundary Layer Code Over 32k Cores

A hybrid OpenMP-MPI code has been developed and optimized for Blue Gene/P in order to perform a direct numerical simulation of a zero-pressure-gradient turbulent boundary layer at high Reynolds numbers. OpenMP is becoming the standard application programming interface for shared memory platforms, offering simplicity and portability. For architectures with limiting memory as Blue Gene/P, the use of OpenMP is especially well suited. MPI communications overhead are also improved due to the decreasing number of processes involved. Two boundary layers are simultaneously run due to physical considerations, represented by two different MPI groups. Different node mappings layouts have been investigated reducing communication times in a factor of two. The present hybrid code shows approximately linear weak scaling up to 32k cores.

Juan Sillero, Guillem Borrell, Javier Jiménez, Robert D. Moser

CAF versus MPI - Applicability of Coarray Fortran to a Flow Solver

We investigate how to use coarrays in Fortran (CAF) for parallelizing a flow solver and the capabilities of current compilers with coarray support. Usability and performance of CAF in mesh-based applications is examined and compared to traditional MPI strategies. We analyze the influence of the memory layout, the usage of communication buffers against direct access to the data used in the computation and different methods of the communication itself. Our objective is to provide insights on how common communication patterns have to be formulated when using coarrays.

Manuel Hasert, Harald Klimach, Sabine Roller

The Impact of Injection Bandwidth Performance on Application Scalability

Future exascale systems are expected to have significantly reduced network bandwidth relative to computational performance than current systems. Clearly, this will impact bandwidth-intensive applications, so it is important to gain insight into the magnitude of the negative impact on performance and scalability to help identify mitigation strategies. In this paper, we show how current systems can be configured to emulate the expected imbalance of future systems. We demonstrate this approach by reducing the network injection bandwidth performance of a 160-node, 1920-core Cray XT5 system and analyze the performance and scalability of a suite of MPI benchmarks and applications.

Kevin T. Pedretti, Ron Brightwell, Doug Doerfler, K. Scott Hemmert, James H. Laros III

Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW

Collective communication is one of the most powerful message passing concepts, enabling parallel applications to express complex communication patterns while allowing the underlying MPI to provide efficient implementations to minimize the cost of the data movements. However, with the increase in the heterogeneity inside the nodes, more specifically the memory hierarchies, harnessing the maximum compute capabilities becomes increasingly difficult. This paper investigates the impact of kernel-assisted MPI communication, over two scientific applications: 1) Car-Parrinello molecular dynamics(CPMD), a chemical molecular dynamics application, and 2) FFTW, a Discrete Fourier Transform (DFT). By focusing on the usage of Message Passing Interface (MPI), we found the communication characteristics and patterns of each application. Our experiments indicate that the quality of the collective communication implementation on a specific machine plays a critical role on the overall application performance.

Teng Ma, Aurelien Bouteiller, George Bosilca, Jack J. Dongarra

A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI

The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum’s Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.

Joshua Hursey, Thomas Naughton, Geoffroy Vallee, Richard L. Graham

Fault Tolerance in an Industrial Seismic Processing Application for Multicore Clusters

Seismic processing applications are used to identify geological structures where reservoirs of oil and gas may be found. With oil companies seeking better precision over larger geographical regions, these applications require larger clusters to keep execution times reasonable. The combination of longer run times and clusters with greater numbers of components increases the probability of faults during the execution. To address this issue, this paper describes an application-level fault tolerance mechanism that considers node crashes and communication link failures. For this industrial application, experiments show that continued execution with the remaining resources is both feasible and efficient.

Alexandre Gonçalves, Matheus Bersot, André Bulcão, Cristina Boeres, Lúcia Drummond, Vinod Rebello

libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we introduce libhashckpt; a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads, we show the merit of this technique for a certain class of HPC applications.

Kurt B. Ferreira, Rolf Riesen, Ron Brighwell, Patrick Bridges, Dorian Arnold

Noncollective Communicator Creation in MPI

MPI communicators abstract communication operations across application modules, facilitating seamless composition of different libraries. In addition, communicators provide the ability to form groups of processes and establish multiple levels of parallelism. Traditionally, communicators have been collectively created in the context of the parent communicator. The recent thrust toward systems at petascale and beyond has brought forth new application use cases, including fault tolerance and load balancing, that highlight the ability to construct an MPI communicator in the context of its new process group as a key capability. However, it has long been believed that MPI is not capable of allowing the user to form a new communicator in this way. We present a new algorithm that allows the user to create such flexible process groups using only the functionality given in the current MPI standard. We explore performance implications of this technique and demonstrate its utility for load balancing in the context of a Markov chain Monte Carlo computation. In comparison with a traditional collective approach, noncollective communicator creation enables a 30% improvement in execution time through asynchronous load balancing.

James Dinan, Sriram Krishnamoorthy, Pavan Balaji, Jeff R. Hammond, Manojkumar Krishnan, Vinod Tipparaju, Abhinav Vishnu

Evaluation of Interpreted Languages with Open MPI

High performance computing (HPC) seems to be one of the last monopolies of low-level languages like C and FORTRAN. The de-facto standard for HPC, the Message Passing Interface (MPI), defines APIs for C, FORTRAN and C++ only. This paper evaluates current alternatives among interpreted languages, specifically Python and C#. MPI library wrappers for both languages are examined and their performance is compared to native (C) Open MPI using two benchmarks. Both languages compare favorably in code and performance effectiveness.

Matti Bickel, Adrian Knoth, Mladen Berekovic

Leveraging C++ Meta-programming Capabilities to Simplify the Message Passing Programming Model

Message passing is the primary programming model utilized for distributed memory systems. Because it aims at performance, the level of abstraction is low, making distributed memory programming often difficult and error-prone. In this paper, we leverage the expressivity and meta-programming capabilities of the C++ language to raise the abstraction level and simplify message passing programming. We redefine the semantics of the assignment operator to work in a distributed memory fashion and leave to the compiler the burden of generating the required communication operations. By enforcing more severe checks at compile-time we are able to statically capture common programming errors without causing runtime overhead.

Simone Pellegrini, Radu Prodan, Thomas Fahringer

Portable and Scalable MPI Shared File Pointers

While the I/O functions described in the MPI standard included shared file pointer support from the beginning, the performance and portability of these functions have been subpar at best. ROMIO [1], which provides the MPI-IO functionality for most MPI libraries, to this day uses a separate file to manage the shared file pointer. This file provides the shared location that holds the current value of the shared file pointer. Unfortunately, each access to the shared file pointer involves file lock management and updates to the file contents. Furthermore, support for shared file pointers is not universally available because few file systems support native shared file pointers [5] and a few file systems do not support file locks [3].

Application developers rarely use shared file pointers, even though many applications can benefit from this file I/O capability. These applications are typically loosely coupled and rarely exhibit application-wide synchronization. Examples include application tracing toolkits [8,4] and many-task computing applications [10]. Other approaches to the shared file pointer I/O models frequently used by these application classes include file-per-process, file-per-thread, and file-per-rank approaches. While these approaches work relatively well at smaller scales, they fail to scale to leadership-class computing systems because of the intense metadata loads generated they generate. Recent research identified significant improvements from using shared-file I/O instead of multifile I/O patterns on leadership-class systems [6].

In this paper, we propose integrating shared file support into the I/O forwarding layer commonly found on leadership-class computing systems. I/O forwarding middleware, such as the I/O Forwarding Scalability Layer (IOFSL) [9,2], bridges the compute and I/O subsystems of leadership-class computing systems. This middleware layer captures all file I/O requests generated by applications executing on compute nodes and forwards them to dedicated I/O nodes. These I/O nodes, a common hardware feature of leadership-class computing systems, execute the I/O requests on behalf of the application. The I/O forwarding layer on these system is best suited to provide and manage shared file pointers because it has access to all application I/O requests and can provide enhanced file I/O capabilities independent of the system and I/O software stack. By embedding this capability into the I/O forwarding layer, applications developers can utilize shared file pointers for a variety of file I/O APIs (MPI-IO, POSIX, and ZOIDFS), synchronization levels (collective and independent I/O), and computing systems (IBM Blue Gene and Cray XT systems).

We are adding several features to IOFSL and ROMIO to enable portable MPI-IO shared file pointer access. In prior work, we extended the ZOIDFS API [2] to provide a distributed atomic append capability. Our current work extends and generalizes this capability to provide shared file pointers as defined by the MPI standard. First, we created a per file shared (key,value) storage space. This capability allows users of the API to instantiate an instance of a ZOIDFS file handle and associate file state with the handle (such as the current position of a file pointer). Since a ZOIDFS file handle is a persistent, globally unique identifier linked to a specific file, this does not result in extra state for the client. To limit the amount of state stored within the I/O node and to enable recovery from faults, we are integrating purge policies for the key value store. Example policies include flushing data to other IOFSL servers or persistently storing this data in extended attribute fields of the target file.

In prior work, we implemented a distributed atomic append by essentially implementing a per file, system wide shared file pointer. In our current work, we instead require a shared file pointer per MPI file handle. This is easily implemented by storing the current value of the shared file pointer in a key uniquely derived from the MPI file handle. We modified ROMIO to generate this unique key. When a file is first opened, a sufficiently large, random identifier is generated. This identifier is subsequently used to retrieve or update the current value of the shared file pointer. To avoid collisions, we rely on the fact that the key space provided by IOFSL supports an exclusive create operation. In the unlikely event that the generated identifier already exists for the file, ROMIO simply generates another one.

By providing set, get, and atomic increment operations, the IOFSL server is responsible for shared file pointer synchronization. This precludes the need for explicit file lock management for shared file pointer support. Overall, few modifications to ROMIO were required. Before executing a shared read or write, ROMIO uses the key store to atomically increment and retrieve the shared file pointer. It then subsequently accesses the file using an ordinary independent I/O operation. To simplify fault tolerance, we plan to combine the I/O access and the key update into one operation. ROMIO’s MPI_File_close method removes the shared file pointer key in order to limit the amount of state held by the I/O nodes. For systems such as the Cray XT series, where I/O nodes are shared among multiple jobs, we automatically purge any keys left by applications that failed to clean up the shared file pointer, for example because of unclean application termination. On systems employing a dedicated I/O node, no cleanup is necessary, since the I/O node (and the IOFSL server) is restarted between jobs.

These modifications provide a low-overhead, file-system-independent, shared file pointer implementation for MPI-IO on those systems supported by IOFSL. Unlike other solutions, our implementation does not require a progress thread or hardware-supported remote memory access functionality [7].

Jason Cope, Kamil Iskra, Dries Kimpe, Robert Ross

Improvement of the Bandwidth of Cross-Site MPI Communication Using Optical Fiber

We perform a set of communication experiments spanning multiple sites on the heterogeneous Grid’5000 infrastructure in France. The backbone widely employs high-bandwidth optical fiber. Experiments with point-to-point MPI communications across sites show much lower bandwidth than expected for the optical fiber connections. This work proposes and tests an alternative implementation of cross-site point-to-point communication, exploiting the observation that a higher bandwidth can be reached when transferring TCP messages in parallel. It spawns additional MPI processes for point-to-point communication and significantly improves the bandwidth for large messages. The approach comes closer to the maximum bandwidth measured without using MPI.

Kiril Dichev, Alexey Lastovetsky, Vladimir Rychkov

Performance Tuning of SCC-MPICH by Means of the Proposed MPI-3.0 Tool Interface

The Single-Chip Cloud Computer (SCC) experimental processor is a 48-core concept vehicle created by Intel Labs as a platform for many-core software research. Intel provides a customized programming library for the SCC, called RCCE, that allows for fast message-passing between the cores. For that purpose, RCCE offers an application programming interface (API) with a semantics that is derived from the well-established MPI standard. However, while the MPI standard offers a very broad range of functions, the RCCE API is consciously kept small and far from implementing all the features of the MPI standard. For this reason, we have implemented an SCC-customized MPI library, called SCC-MPICH, which in turn is based upon an extension to the SCC-native RCCE communication library. In this contribution, we will present SCC-MPICH and we will show how performance analysis as well as performance tuning for this library can be conducted by means of a prototype of the proposed MPI-3.0 tool information interface.

Carsten Clauss, Stefan Lankes, Thomas Bemmerl

Design and Implementation of Key Proposed MPI-3 One-Sided Communication Semantics on InfiniBand

As part of the MPI-3 effort, Remote Memory Access (RMA) working group has proposed several extensions to the one-sided communication interface. These extensions promise to address several of its existing limitations. However, their performance advantages need to be clearly highlighted for widespread acceptance of the new interface. In this paper, we present design and implementation of some of the key one-sided semantics proposed for MPI-3 over InfiniBand, using the MVAPICH2 library. Our evaluation shows that the newly proposed Flush semantics allow for more efficient handling of completions and the request-based operations can help achieve close to optimal overlap in a Get-Compute-Put model.

Sreeram Potluri, Sayantan Sur, Devendar Bureddy, Dhabaleswar K. Panda

Scalable Distributed Consensus to Support MPI Fault Tolerance

As system sizes increase, the amount of time in which an application can run without experiencing a failure decreases. Exascale applications will need to address fault tolerance. In order to support algorithm-based fault tolerance, communication libraries will need to provide fault-tolerance features to the application. One important fault-tolerance operation is distributed consensus. This is used, for example, to collectively decide on a set of failed processes. This paper describes a scalable, distributed consensus algorithm that is used to support new MPI fault-tolerance features proposed by the MPI 3 Forum’s fault-tolerance working group. The algorithm was implemented and evaluated on a 4,096-core Blue Gene/P. The implementation was able to perform a full-scale distributed consensus in 305

μ

s and scaled logarithmically.

Darius Buntinas

Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance

The MPI standard lacks semantics and interfaces for sustained application execution in the presence of process failures. Exascale HPC systems may require scalable, fault resilient MPI applications. The mission of the MPI Forum’s Fault Tolerance Working Group is to enhance the standard to enable the development of scalable, fault tolerant HPC applications. This paper presents an overview of the Run-Through Stabilization proposal. This proposal allows an application to continue execution even if MPI processes fail during execution. The discussion introduces the implications on point-to-point and collective operations over communicators, though the full proposal addresses all aspects of the MPI standard.

Joshua Hursey, Richard L. Graham, Greg Bronevetsky, Darius Buntinas, Howard Pritchard, David G. Solt

Integrating MPI with Asynchronous Task Parallelism

This paper describes a programming model that integrates intra-node asynchronous task parallelism with inter-node MPI communications to address the hybrid parallelism challenges faced by future extreme scale systems. We explore the integration of MPI’s blocking and non-blocking communications with lightweight tasks. We also provide the implementation details of a non-blocking runtime execution model based on computation and communication workers.

Yonghong Yan, Sanjay Chatterjee, Zoran Budimlic, Vivek Sarkar

Performance Evaluation of Thread-Based MPI in Shared Memory

Although Unified Parallel C and OpenMP are being proposed for supporting more efficiently multicore architectures, the fact is that MPI is still used as a useful model on shared memory machines. Traditional mainstream MPI implementations as MPICH2 and OpenMPI build each MPI task as a process, an approach that presents some disadvantages in shared memory because message passing between processes usually requires two copies. One-copy communication can be achieved with operating system support, through kernel modules like KNEM or Limic, techniques like SMARTMAP, or special system calls like

vmsplice

, with disadvantages mainly in portability and limited improvements in performance. It is also possible building each MPI task as a thread. This is not a new concept. Implementations such as TOMPI, TMPI, or the newer MPI Actor or MPC-MPI run an MPI node as a thread, each one stressing different goals. AzequiaMPI is a thread-based but still a full conformant open source implementation of the MPI-1.3 standard. AzequiaMPI shows that MPI performance can be significatively improved by fully exploiting a single shared address space.

Juan-Antonio Rico-Gallego, Juan-Carlos Díaz-Martín

MPI-DB, A Parallel Database Services Software Library for Scientific Computing

Large-scale scientific simulations generate petascale data sets subsequently analyzed by groups of researchers, often in databases. We developed a software library, MPI-DB, to provide database services to scientific computing applications. As a bridge between CPU-intensive and data-intensive computations, MPI-DB exploits massive parallelism within large databases to provide scalable, fast service. It is built as a client-server framework, using MPI, with MPI-DB server acting as an intermediary between the user application running an MPI-DB client and the database servers. MPI-DB provides high-level objects such as multi-dimensional arrays, acting as an abstraction layer that effectively hides the database from the end user.

Edward Givelberg, Alexander Szalay, Kalin Kanov, Randal Burns

Scalable Runtime for MPI: Efficiently Building the Communication Infrastructure

The runtime environment of MPI implementations plays a key role to launch the application, to provide out-of-band communications, enabling I/O forwarding and bootstrapping of the connections of high-speed networks, and to control the correct termination of the parallel application. In order to enable all these roles on a exascale parallel machine, which features hundreds of thousands of computing nodes (each of them featuring thousands of cores), scalability of the runtime environment must be a primary goal.

George Bosilca, Thomas Herault, Pierre Lemarinier, Ala Rezmerita, Jack J. Dongarra

Writing Parallel Libraries with MPI - Common Practice, Issues, and Extensions

Modular programming is an important software design concept. We discuss principles for programming parallel libraries, show several successful library implementations, and introduce a taxonomy for existing parallel libraries. We derive common requirements that parallel libraries pose on the programming framework. We then show how those requirements are supported in the Message Passing Interface (MPI) standard. We also note several potential pitfalls for library implementers using MPI. Finally, we conclude with a discussion of state-of-the art of parallel library programming and we provide some guidelines for library designers.

Torsten Hoefler, Marc Snir

Springer Professional

About this book

Table of Contents

Frontmatter