Large Scale Systems

A Scalable MPI_Comm_split Algorithm for Exascale Computing

Existing algorithms for creating communicators in MPI programs will not scale well to future exascale supercomputers containing millions of cores. In this work, we present a novel communicator-creation algorithm that does scale well into millions of processes using three techniques: replacing the sorting at the end of

MPI_Comm_split

with merging as the color and key table is built, sorting the color and key table in parallel, and using a distributed table to store the output communicator data rather than a replicated table. This reduces the time cost of

MPI_Comm_split

in the worst case we consider from 22 seconds to 0.37 second. Existing algorithms build a table with as many entries as processes, using vast amounts of memory. Our algorithm uses a small, fixed amount of memory per communicator after

MPI_Comm_split

has finished and uses a fraction of the memory used by the conventional algorithm for temporary storage during the execution of

MPI_Comm_split

.

Paul Sack, William Gropp

Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems

With the ever-increasing numbers of cores per node on HPC systems, applications are increasingly using threads to exploit the shared memory within a node, combined with MPI across nodes. Achieving high performance when a large number of concurrent threads make MPI calls is a challenging task for an MPI implementation. We describe the design and implementation of our solution in MPICH2 to achieve high-performance multithreaded communication on the IBM Blue Gene/P. We use a combination of a multichannel-enabled network interface, fine-grained locks, lock-free atomic operations, and specially designed queues to provide a high degree of concurrent access while still maintaining MPI’s message-ordering semantics. We present performance results that demonstrate that our new design improves the multithreaded message rate by a factor of 3.6 compared with the existing implementation on the BG/P. Our solutions are also applicable to other high-end systems that have parallel network access capabilities.

Gábor Dózsa, Sameer Kumar, Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Joe Ratterman, Rajeev Thakur

Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues

Designing and tuning parallel applications with MPI, particularly at large scale, requires understanding the performance implications of different choices of algorithms and implementation options. Which algorithm is better depends in part on the performance of the different possible communication approaches, which in turn can depend on both the system hardware and the MPI implementation. In the absence of detailed performance models for different MPI implementations, application developers often must select methods and tune codes without the means to realistically estimate the achievable performance and rationally defend their choices. In this paper, we advocate the construction of more useful performance models that take into account limitations on network-injection rates and effective bisection bandwidth. Since collective communication plays a crucial role in enabling scalability, we also provide analytical models for scalability of collective communication algorithms, such as broadcast, allreduce, and all-to-all. We apply these models to an IBM Blue Gene/P system and compare the analytical performance estimates with experimentally measured values.

Torsten Hoefler, William Gropp, Rajeev Thakur, Jesper Larsson Träff

PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems

Parallel programming models on large-scale systems require a scalable system for managing the processes that make up the execution of a parallel program. The process-management system must be able to launch millions of processes quickly when starting a parallel program and must provide mechanisms for the processes to exchange the information needed to enable them communicate with each other. MPICH2 and its derivatives achieve this functionality through a carefully defined interface, called PMI, that allows different process managers to interact with the MPI library in a standardized way. In this paper, we describe the features and capabilities of PMI. We describe both PMI-1, the current generation of PMI used in MPICH2 and all its derivatives, as well as PMI-2, the second-generation of PMI that eliminates various shortcomings in PMI-1. Together with the interface itself, we also describe a reference implementation for both PMI-1 and PMI-2 in a new process-management framework within MPICH2, called Hydra, and compare their performance in running MPI jobs with thousands of processes.

Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing Lusk, Rajeev Thakur

Run-Time Analysis and Instrumentation for Communication Overlap Potential

Blocking communication can be runtime optimized into non-blocking communication using memory protection and replacement of MPI functions. All such optimizations come with overhead, meaning no automatic optimization can reach the performance level of hand-optimized code.In this paper, we present a method for using previously published runtime optimizers to instrument a program, including measured speedup gains and overhead.The results are connected with the program symbol table and presented to the user as a series of source code transformations. Each series indicates which optimizations were performed and what the expected saving in wallclock time is if the optimization is done by hand.

Thorvald Natvig, Anne C. Elster

Efficient MPI Support for Advanced Hybrid Programming Models

The number of multithreaded Message Passing Interface (MPI) implementations and applications is increasing rapidly. We discuss how multithreaded applications can receive messages of unknown size. As is well known, combining

MPI_Probe/MPI_Recv

is not thread-safe, but many assume that trivial workarounds exist. We discuss those workarounds and show how they fail in practice by either limiting the available parallelism unnecessarily, consuming resources in a non-scalable way, or promoting global deadlocks. In this light, we propose two fundamentally different efficient approaches to enable thread-safe messaging in MPI-2.2: fine-grained locking and matching outside of MPI. Our approaches provide thread-safe probe and receive functionality, but both have deficiencies, including performance limitations and programming complexity, that could be avoided if MPI would offer a thread-safe (stateless) interface to

MPI_Probe

. We propose such an extension for the upcoming MPI-3 standard, provide a reference implementation, and demonstrate significant performance benefits.

Torsten Hoefler, Greg Bronevetsky, Brian Barrett, Bronis R. de Supinski, Andrew Lumsdaine

Parallel Filesystems and I/O

An HDF5 MPI Virtual File Driver for Parallel In-situ Post-processing

With simulation codes becoming more powerful, using more and more resources, and producing larger and larger data, monitoring or post-processing simulation data

in-situ

has obvious advantages over the conventional approach of saving to – and reloading data from – the file system. The time it takes to write and then read the data from disk is a significant bottleneck for both the simulation and subsequent post-processing. In order to be able to post-process data as efficiently as possible with minimal disruption to the simulation itself, we have developed a parallel virtual file driver for the HDF5 library which acts as an MPI-IO virtual file layer, allowing the simulation to write in parallel to remotely located distributed shared memory instead of writing to disk.

Jerome Soumagne, John Biddiscombe, Jerry Clarke

Automated Tracing of I/O Stack

Efficient execution of parallel scientific applications requires high-performance storage systems designed to meet their I/O requirements. Most high-performance I/O intensive applications access multiple layers of the storage stack during their disk operations. A typical I/O request from these applications may include accesses to high-level libraries such as MPI I/O, executing on clustered parallel file systems like PVFS2, which are in turn supported by native file systems like Linux. In order to design and implement parallel applications that exercise this I/O stack, it is important to understand the flow of I/O calls through the entire storage system. Such understanding helps in identifying the potential performance and power bottlenecks in different layers of the storage hierarchy. To trace the execution of the I/O calls and to understand the complex interactions of multiple user-libraries and file systems, we propose an automatic code instrumentation technique, which enables us to collect detailed statistics of the I/O stack. Our proposed I/O tracing tool traces the flow of I/O calls across different layers of an I/O stack, and can be configured to work with different file systems and user-libraries. It also analyzes the collected information to generate output in terms of different user-specified metrics of interest.

Seong Jo Kim, Yuanrui Zhang, Seung Woo Son, Ramya Prabhakar, Mahmut Kandemir, Christina Patrick, Wei-keng Liao, Alok Choudhary

MPI Datatype Marshalling: A Case Study in Datatype Equivalence

MPI datatypes are a convenient abstraction for manipulating complex data structures and are useful in a number of contexts. In some cases, these descriptions need to be preserved on disk or communicated between processes, such as when defining RMA windows. We propose an extension to MPI that enables marshalling and unmarshalling MPI datatypes in the spirit of

MPI_Pack/MPI_Unpack

. Issues in MPI datatype equivalence are discussed in detail and an implementation of the new interface outside of MPI is presented. The new marshalling interface provides a mechanism for serializing all aspects of an MPI datatype: the typemap, upper/lower bounds, name, contents/envelope information, and attributes.

Dries Kimpe, David Goodell, Robert Ross

Collective Operations

Design of Kernel-Level Asynchronous Collective Communication

Overlapping computation and communication, not only point-to-point but also collective communications, is an important technique to improve the performance of parallel programs. Since the current non-blocking collective communications have been mostly implemented using an extra thread to progress communication, they have extra overhead due to thread scheduling and context switching. In this paper, a new non- blocking communication facility, called KACC is proposed to provide fast asynchronous collective communications. KACC is implemented in the OS kernel interrupt context to perform non-blocking asynchronous collective operations without an extra thread. The experimental results show that the CPU time cost of this method is sufficiently small.

Akihiro Nomura, Yutaka Ishikawa

Network Offloaded Hierarchical Collectives Using ConnectX-2’s CORE-Direct Capabilities

As the scale of High Performance Computing (HPC) systems continues to increase, demanding that we extract even more parallelism from applications, the need to move communication management away from the Central Processing Unit (CPU) becomes even greater. Moving this management to the network, frees up CPU cycles for computation, making it possible to overlap computation and communication. In this paper we continue to investigate how to best use the new CORE-

Direct

support added in the ConnectX-2 Host Channel Adapter (HCA) for creating high performance, asynchronous collective operations that are managed by the HCA. Specifically we consider the network topology, creating a two-level communication hierarchy, reducing the MPI_Barrier completion time by 45%, from 26.59 microseconds, when not considering network topology, to 14.72 microseconds, with the CPU based collective barrier operation completing in 19.04 microseconds. The nonblocking barrier algorithm has similar performance, with about 50% of that time available for computation.

Ishai Rabinovitz, Pavel Shamis, Richard L. Graham, Noam Bloch, Gilad Shainer

An In-Place Algorithm for Irregular All-to-All Communication with Limited Memory

In this article, we propose an in-place algorithm for irregular all-to-all communication corresponding to the

MPI_Alltoallv

operation. This in-place algorithm uses a single message buffer and replaces the outgoing messages with the incoming messages. In comparison to existing support for in-place communication in MPI, the proposed algorithm for

MPI_Alltoallv

has no restriction on the message sizes and displacements. The algorithm requires memory whose size does not depend on the message sizes. Additional memory of arbitrary size can be used to improve its performance. Performance results for a Blue Gene/P system are shown to demonstrate the performance of the approach.

Michael Hofmann, Gudula Rünger

Applications

Massively Parallel Finite Element Programming

Today’s large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes.

Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library

deal.II

to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability.

Timo Heister, Martin Kronbichler, Wolfgang Bangerth

Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient Using MPI Datatypes

Many parallel applications need to communicate non- contiguous data. Most applications manually copy (pack/unpack) data before communications even though MPI allows a

zero-copy

specification. In this work, we study two complex use-cases: (1) Fast Fourier Transformation where we express a local memory transpose as part of the datatype, and (2) a conjugate gradient solver with a checkerboard layout that requires multiple nested datatypes. We demonstrate significant speedups up to a factor of 3.8 and 18%, respectively, in both cases. Our work can be used as a template to utilize datatypes for application developers. For MPI implementers, we show two practically relevant access patterns that deserve special optimization.

Torsten Hoefler, Steven Gottlieb

Parallel Chaining Algorithms

Given a set of weighted hyper-rectangles in a

k

-dimensional space, the chaining problem is to identify a set of colinear and non-overlapping hyper-rectangles of total maximal weight. This problem is used in a number of applications in bioinformatics, string processing, and VLSI design. In this paper, we present parallel versions of the chaining algorithm for bioinformatics applications, running on multi-core and computer cluster architectures. Furthermore, we present experimental results of our implementations on both architectures.

Mohamed Abouelhoda, Hisham Mohamed

MPI Internals (I)

Precise Dynamic Analysis for Slack Elasticity: Adding Buffering without Adding Bugs

Increasing the amount of buffering for MPI sends is an effective way to improve the performance of MPI programs. However, for programs containing non-deterministic operations, this can result in

new

deadlocks or other safety assertion violations. Previous work did not provide any characterization of the space of

slack elastic

programs: those for which buffering can be safely added. In this paper, we offer a precise characterization of slack elasticity based on our formulation of MPI’s

happens before

relation. We show how to efficiently locate

potential culprit sends

in such programs: MPI sends for which adding buffering can increase overall program non-determinism and cause new bugs. We present a procedure to minimally enumerate potential culprit sends and efficiently check for slack elasticity. Our results demonstrate that our new algorithm called POE

MSE

which is incorporated into our dynamic verifier ISP can efficiently run this new analysis on large MPI programs.

Sarvani Vakkalanka, Anh Vo, Ganesh Gopalakrishnan, Robert M. Kirby

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Commercial HPC applications are often run on clusters that use the Microsoft Windows operating system and need an MPI implementation that runs efficiently in the Windows environment. The MPI developer community, however, is more familiar with the issues involved in implementing MPI in a Unix environment. In this paper, we discuss some of the differences in implementing MPI on Windows and Unix, particularly with respect to issues such as asynchronous progress, process management, shared-memory access, and threads. We describe how we implement MPICH2 on Windows and exploit these Windows-specific features while still maintaining large parts of the code common with the Unix version. We also present performance results comparing the performance of MPICH2 on Unix and Windows on the same hardware. For zero-byte MPI messages, we measured excellent shared-memory latencies of 240 and 275 nanoseconds on Unix and Windows, respectively.

Jayesh Krishna, Pavan Balaji, Ewing Lusk, Rajeev Thakur, Fabian Tiller

Compact and Efficient Implementation of the MPI Group Operations

We describe a more compact representation of MPI

process groups

based on strided, partial sequences that can support all group and communicator creation operations in time proportional to the size of the argument groups. The worst case lookup time (to determine the global processor id corresponding to a local process rank) is logarithmic, but often better (constant), and can be traded against maximum possible compaction. Many commonly used MPI process groups can be represented in

constant space

with

constant lookup time

, for instance the process group of

MPI_COMM_WORLD

, and all consecutive subgroups of this group, but also many, many others). The representation never uses more than one word per process, but often much less, and is in this sense strictly better than the trivial, often used representation by means of a simple mapping array. The data structure and operations have all been implemented, and experiments show very worthwhile space savings for classes of process groups that are believed to be typical of MPI applications.

Jesper Larsson Träff

Characteristics of the Unexpected Message Queue of MPI Applications

High Performance Computing systems are used on a regular basis to run a myriad of application codes, yet a surprising dearth of information exists with respect to communications characteristics. Even less information is available on the low-level communication libraries, such as the length of MPI Unexpected Message Queues (UMQs) and the length of time such messages spend in these queues. Such information is vital to developing appropriate strategies for handling such data at the library and system level. In this paper we present data on the communication characteristics of three applications GTC, LSMS, and S3D. We present data on the size of their UMQ, the time spend searching the UMQ and the length of time such messages spend in these queues. We find that for the particular inputs used, these applications have widely varying characteristics with regard to UMQ length and show patterns for specific applications which persist over various scales.

Rainer Keller, Richard L. Graham

Fault Tolerance

Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols

With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault tolerant; most are in need for a seamless recovery framework. Among the automatic fault tolerant techniques proposed for MPI, message logging is preferable for its scalable recovery. The major challenge for message logging protocols is the performance penalty on communications during failure-free periods, mostly coming from the payload copy introduced for each message. In this paper, we investigate different approaches for logging payload and compare their impact on network performance.

George Bosilca, Aurelien Bouteiller, Thomas Herault, Pierre Lemarinier, Jack J. Dongarra

Communication Target Selection for Replicated MPI Processes

VolpexMPI is an MPI library designed for volunteer computing environments. In order to cope with the fundamental unreliability of these environments, VolpexMPI deploys two or more replicas of each MPI process. A receiver-driven communication scheme is employed to eliminate redundant message exchanges and sender based logging is employed to ensure seamless application progress with varying processor execution speeds and routine failures. In this model, to execute a receive operation, a decision has to be made as to which of the sending process replicas should be contacted first. Contacting the fastest replica appears to be the optimal local decision, but it can be globally non-optimal as it may slowdown the fastest replica. Further, identifying the fastest replica during execution is a challenge in itself. This paper evaluates various target selection algorithms to manage these trade-offs with the objective of minimizing the overall execution time. The algorithms are evaluated for the NAS Parallel Benchmarks utilizing heterogeneous network configurations, heterogeneous processor configurations and a combination of both.

Rakhi Anand, Edgar Gabriel, Jaspal Subhlok

Transparent Redundant Computing with MPI

Extreme-scale parallel systems will require alternative methods for applications to maintain current levels of uninterrupted execution. Redundant computation is one approach to consider, if the benefits of increased resiliency outweigh the cost of consuming additional resources. We describe a transparent redundancy approach for MPI applications and detail two different implementations that provide the ability to tolerate a range of failure scenarios, including loss of application processes and connectivity. We compare these two approaches and show performance results from micro-benchmarks that bound worst-case message passing performance degradation. We propose several enhancements that could lower the overhead of providing resiliency through redundancy.

Ron Brightwell, Kurt Ferreira, Rolf Riesen

Checkpoint/Restart-Enabled Parallel Debugging

Debugging is often the most time consuming part of software development. HPC applications prolong the debugging process by adding more processes interacting in dynamic ways for longer periods of time. Checkpoint/restart-enabled parallel debugging returns the developer to an intermediate state closer to the bug. This focuses the debugging process, saving developers considerable amounts of time, but requires parallel debuggers cooperating with MPI implementations and checkpointers. This paper presents a design specification for such a cooperative relationship. Additionally, this paper discusses the application of this design to the GDB and DDT debuggers, Open MPI, and BLCR projects.

Joshua Hursey, Chris January, Mark O’Connor, Paul H. Hargrove, David Lecomber, Jeffrey M. Squyres, Andrew Lumsdaine

Best Paper Awards

Load Balancing for Regular Meshes on SMPs with MPI

Domain decomposition for regular meshes on parallel computers has traditionally been performed by attempting to exactly partition the work among the available processors (now cores). However, these strategies often do not consider the inherent system noise which can hinder MPI application scalability to emerging peta-scale machines with 10000+ nodes. In this work, we suggest a solution that uses a tunable hybrid static/dynamic scheduling strategy that can be incorporated into current MPI implementations of mesh codes. By applying this strategy to a 3D jacobi algorithm, we achieve performance gains of at least 16% for 64 SMP nodes.

Vivek Kale, William Gropp

Adaptive MPI Multirail Tuning for Non-uniform Input/Output Access

Multicore processors have not only reintroduced Non-Uniform Memory Access (NUMA) architectures in nowadays parallel computers, but they are also responsible for non-uniform access times with respect to Input/Output devices (NUIOA). In clusters of multicore machines equipped with several network interfaces, performance of communication between processes thus depends on which cores these processes are scheduled on, and on their distance to the Network Interface Cards involved. We propose a technique allowing multirail communication between processes to carefully distribute data among the network interfaces so as to counterbalance NUIOA effects. We demonstrate the relevance of our approach by evaluating its implementation within

Open

MPI on a

Myri-10G

+

InfiniBand

cluster.

Stéphanie Moreaud, Brice Goglin, Raymond Namyst

Using Triggered Operations to Offload Collective Communication Operations

Efficient collective operations are a major component of application scalability. Offload of collective operations onto the network interface reduces many of the latencies that are inherent in network communications and, consequently, reduces the time to perform the collective operation. To support offload, it is desirable to expose semantic building blocks that are simple to offload and yet powerful enough to implement a variety of collective algorithms. This paper presents the implementation of barrier and broadcast leveraging triggered operations — a semantic building block for collective offload. Triggered operations are shown to be both semantically powerful and capable of improving performance.

K. Scott Hemmert, Brian Barrett, Keith D. Underwood

MPI Internals (II)

Second-Order Algorithmic Differentiation by Source Transformation of MPI Code

A source transformation tool for algorithmic differentiation is introduced, capable of transforming MPI-enabled code into second-order adjoint code. Our derivative code compiler (dcc) is used for the source transformation while a runtime library handles the adjoining of the MPI routines. This paper describes in detail the link between these two components in order to compute second derivatives. This process is illustrated by a simplified parallel implementation of Burgers’ equation in a second-order optimization setting, for example, Newton’s method.

Michel Schanen, Michael Förster, Uwe Naumann

Locality and Topology Aware Intra-node Communication among Multicore CPUs

A major trend in HPC is the escalation toward

manycore

, where systems are composed of shared memory nodes featuring numerous processing units. Unfortunately, with scale comes complexity, here in the form of non-uniform memory accesses and cache hierarchies. For most HPC applications, harnessing the power of multicores is hindered by the topology oblivious tuning of the MPI library. In this paper, we propose a framework to tune every type of shared memory communications according to locality and topology. An implementation inside Open MPI is evaluated experimentally and demonstrates significant speedups compared to vanilla Open MPI and MPICH2.

Teng Ma, George Bosilca, Aurelien Bouteiller, Jack J. Dongarra

Transparent Neutral Element Elimination in MPI Reduction Operations

We describe simple and easy to implement MPI library internal functionality that enables MPI

reduction operations

to be performed more efficiently with increasing sparsity (fraction of neutral elements for the given operator) of the input (and intermediate result) vectors. Using this functionality we give an implementation of the

MPI_Reduce

collective operation that completely transparently to the application programmer exploits sparsity of both input and intermediate result vectors. Experiments carried out on a 64-core Intel Nehalem multi-core cluster with InfiniBand interconnect show considerable and worthwhile improvements as the sparsity of the input grows, about a factor of three with 1% non-zero elements which is close to best possible for the approach. The overhead incurred for dense vectors is negligible when compared to the same implementation not exploiting sparsity of input and intermediate results. The implemented

SPS_Reduce

function is for both very small and large vectors faster than the native

MPI_Reduce

of the used MPI library, indicating that the improvements reported are not artifacts of suboptimal reduction algorithms.

Jesper Larsson Träff

Poster Abstracts

Use Case Evaluation of the Proposed MPIT Configuration and Performance Interface

In this contribution, we present our experiences gained while prototyping the MPIT Configuration and Performance Interface that is currently under discussion for being integrated into the next MPI standard. The work is based on an API draft that has been recently released by the MPI Tools Working Group [1]. As a use case, we have already developed a simple tuning tool on top of the proposed MPIT interface that can help to optimize protocol thresholds of MPI implementations according to communication characteristics of the respective applications.

Carsten Clauss, Stefan Lankes, Thomas Bemmerl

Two Algorithms of Irregular Scatter/Gather Operations for Heterogeneous Platforms

In this work we present two algorithms of irregular scatter/gather operations based on the binomial tree and Träff algorithms. We use the prediction provided by heterogeneous communication performance models when constructing communication trees for these operations. The experiments show that the model-based algorithms outperform the traditional ones on heterogeneous platforms.

Kiril Dichev, Vladimir Rychkov, Alexey Lastovetsky

Measuring Execution Times of Collective Communications in an Empirical Optimization Framework

An essential part of an empirical optimization library are the timing procedures with which the performance of different codelets is determined. In this paper, we present for four different timing methods to optimize collective MPI communications and compare their accuracy for the FFT NAS Parallel Benchmarks on a variety of systems with different MPI implementations. We find that timing larger code portions with infrequent synchronizations performs well on all systems.

Katharina Benkert, Edgar Gabriel

Dynamic Verification of Hybrid Programs

Hybrid (mixed MPI/thread) programs are extremely important for efficiently programming future HPC systems. In this paper, we report our experience adapting ISP [3,4,5], our dynamic verifier for MPI programs, to verify a large hybrid MPI/Pthread program called Eddy Murphi [1]. ISP is a stateless model checker that works by replaying schedules leading up to previously recorded non-deterministic selection points, and pursuing new behaviors out of these points. The main difficulty we faced was the inability to

deterministically replay

up to these selection points because ISP instruments only the MPI calls issued by an application, whereas thread level scheduling non-determinism may change the course of execution. Instrumenting both MPI and Pthreads API calls requires an invasive modification of ISP which was not favored. The novelty of our solution is to determinize thread schedules using a record/replay daemon and demonstrating that this approach works on a realistic hybrid application: the Eddy Murphi model checker.

Wei-Fan Chiang, Grzegorz Szubzda, Ganesh Gopalakrishnan, Rajeev Thakur

Challenges and Issues of Supporting Task Parallelism in MPI

Task parallelism deals with the extraction of the potential parallelism of irregular structures, which vary according to the input data, through a definition of abstract tasks and their dependencies. Shared-memory APIs, such as OpenMP and TBB, support this model and ensure performance thanks to an efficient scheduling of tasks. In this work, we provide arguments favoring the support of task parallelism in MPI. We explain how native MPI can be used to define tasks, their dependencies, and their runtime scheduling. We also discuss performance issues. Our preliminary experiments show that it is possible to implement efficient task-parallel MPI programs and to increase the range of applications covered by the MPI standard.

Márcia C. Cera, João V. F. Lima, Nicolas Maillard, Philippe O. A. Navaux

Springer Professional

Inhaltsverzeichnis

Frontmatter

Large Scale Systems

A Scalable MPI_Comm_split Algorithm for Exascale Computing

Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems

Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues

PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems

Run-Time Analysis and Instrumentation for Communication Overlap Potential

Efficient MPI Support for Advanced Hybrid Programming Models

Parallel Filesystems and I/O

An HDF5 MPI Virtual File Driver for Parallel In-situ Post-processing

Automated Tracing of I/O Stack

MPI Datatype Marshalling: A Case Study in Datatype Equivalence

Collective Operations

Design of Kernel-Level Asynchronous Collective Communication

Network Offloaded Hierarchical Collectives Using ConnectX-2’s CORE-Direct Capabilities

An In-Place Algorithm for Irregular All-to-All Communication with Limited Memory

Applications

Massively Parallel Finite Element Programming

Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient Using MPI Datatypes

Parallel Chaining Algorithms

MPI Internals (I)

Precise Dynamic Analysis for Slack Elasticity: Adding Buffering without Adding Bugs

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Compact and Efficient Implementation of the MPI Group Operations

Characteristics of the Unexpected Message Queue of MPI Applications

Fault Tolerance

Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols

Communication Target Selection for Replicated MPI Processes

Transparent Redundant Computing with MPI

Checkpoint/Restart-Enabled Parallel Debugging

Best Paper Awards

Load Balancing for Regular Meshes on SMPs with MPI

Adaptive MPI Multirail Tuning for Non-uniform Input/Output Access

Using Triggered Operations to Offload Collective Communication Operations

MPI Internals (II)

Second-Order Algorithmic Differentiation by Source Transformation of MPI Code

Locality and Topology Aware Intra-node Communication among Multicore CPUs

Transparent Neutral Element Elimination in MPI Reduction Operations

Poster Abstracts

Use Case Evaluation of the Proposed MPIT Configuration and Performance Interface

Two Algorithms of Irregular Scatter/Gather Operations for Heterogeneous Platforms

Measuring Execution Times of Collective Communications in an Empirical Optimization Framework

Dynamic Verification of Hybrid Programs

Challenges and Issues of Supporting Task Parallelism in MPI

Backmatter

Premium Partner