Skip to main content
Top

2025 | Book

Recent Advances in the Message Passing Interface

31st European MPI Users' Group Meeting, EuroMPI 2024, Perth, WA, Australia, September 25–27, 2024, Proceedings

insite
SEARCH

About this book

This volume LNCS constitutes the refereed proceedings of 31st European MPI Users' Group Meeting, EuroMPI 2024, held in Perth, WA, Australia, during September 25-27, 2024.

The 8 full papers presented here were carefully reviewed and selected from 19 submissions. These papers have been categorized under the following topical sections: Compile-time Correctness checks and optimization; Limitations and Extensions for GPGPUs in MPI; Improvements for MPI and MPI Ecosystem.

Table of Contents

Frontmatter

Compile-Time Correctness Checks and Optimization

Frontmatter
SPMD IR: Unifying SPMD and Multi-value IR Showcased for Static Verification of Collectives
Abstract
To effectively utilize modern HPC clusters, inter-node communication and related single program, multiple data (SPMD) parallel programming models such as mpi are inevitable. Current tools and compilers that employ analyses of SPMD models often have the limitation of only supporting one model or implementing the necessary abstraction internally. This makes the analysis and effort for the abstraction neither reusable nor the tool extensible to other models without extensive changes to the tool itself.
This work proposes an spmd ir as part of a multi-layer program representation and accompanying compiler passes to explicitly express the results of abstraction and multi-value analysis. The spmd ir makes the executing processes of operations explicit and differentiates between static and dynamic cases. It is implemented as a prototype in the mlir llvm infrastructure and is comprised of the spmd dialect and two compiler passes, supporting mpi, shmem, and nccl, including hybrid cases.
To evaluate the proposed IR, verification of collective communication was chosen as a use case. For that, this work reimplements and extends parcoach’s static approach on the spmd ir and assesses it by an expanded micro-benchmark suite in mpi, shmem, and nccl. Achieving similar detection accuracy, the evaluation shows that the spmd ir’s level of abstraction is strong enough for parcoach’s analyses and generic enough for increased extensibility. The prototype also constitutes the first collectives verification of shmem, nccl, and their combinations (with mpi).
Semih Burak, Ivan R. Ivanov, Jens Domke, Matthias Müller
Annotation of Compiler Attributes for MPI Functions
Abstract
This paper explores the use of LLVM IR function and parameter attributes to enhance compiler optimizations for code that uses MPI. As MPI is usually used as a dynamically linked library, the compiler is not able to automatically infer certain function attributes like nofree, which signals that no memory is deallocated in this function. Therefore, we implemented an LLVM compiler pass that annotates the used MPI functions with suitable attributes when compiling the user application. We manually derived applicable attributes based on the semantics described in the MPI standard, so that this approach is applicable to all MPI implementations.
We showcase different cases where this additional annotations impact the code generated by the compiler for the MiniApps from the Exascale Proxy Applications Project. The addition of MPI function annotations allows for a variety of compiler optimizations like reducing unnecessary memory accesses, optimizing register usage, and streamlining control flow.
The code of our annotation pass is available on GitHub: https://​github.​com/​AdrSchm/​mpi-attributes-pass.
Tim Jammer, Adrian Schmidt, Christian Bischof

Limitations and Extensions for GPGPUs in MPI

Frontmatter
Understanding GPU Triggering APIs for MPI+X Communication
Abstract
GPU-enhanced architectures are now dominant in HPC systems, but message-passing communication involving GPUs with MPI has proven to be both complex and expensive, motivating new approaches that lower such costs. We compare and contrast stream/graph-, kernel-triggered, and GPU-initiated MPI communication abstractions, whose principal purpose is to enhance the performance of communication when GPU kernels create or consume data for transfer through MPI operations. Researchers and practitioners have proposed multiple potential APIs for GPU-involved communication that span various GPU architectures and approaches, including MPI-4 partitioned point-to-point communication, stream communicators, and explicit MPI stream/queue objects. Designs breaking backward compatibility with MPI are duly noted. Some of these strengthen or weaken the semantics of MPI operations. A key contribution of this paper is to promote community convergence toward common abstractions for GPU-involved communication by highlighting the common and differing goals and contributions of existing abstractions. We describe the design space in which these abstractions reside, their implicit or explicit use of stream and other non-MPI abstractions, their relationship to partitioned and persistent operations, and discuss their potential for added performance, how usable these abstractions are, and where functional and/or semantic gaps exist. Finally, we provide a taxonomy for these abstractions, including disambiguation of similar semantic terms, and consider directions for future standardization in MPI-5.
Patrick G. Bridges, Anthony Skjellum, Evan D. Suggs, Derek Schafer, Purushotham V. Bangalore
Stream Support in MPI Without the Churn
Abstract
Accelerators have become a corner stone of parallel computing, ranging from scientific computing to artificial intelligence. At the application level, accelerators are controlled by submitting work into a stream, from which the work is executed by the hardware. Vendor-specific communication libraries such as NCCL and RCCL have integrated support for submitting communication operations onto a stream to enable ordering of communication and work on streams. It is safe to assume that stream-based computing will remain relevant for the foreseeable future. MPI has yet to catch up to this reality and prior proposals involved extensions of MPI that would incur significant additions to the API.
In this work, we explore alternatives that involve only minor additions to the standard to enable the integration of MPI operations with compute stream. Our additions include i) associating streams with communication objects, ii) blocking streams until completion, and iii) synchronizing streams while progressing MPI operations. Our API is agnostic of the type of stream, reuses existing communication procedures and semantics, and enables integration with graph capturing. We provide a proof-of-concept implementation and show that stream integration of MPI operations can be beneficial.
Joseph Schuchart, Edgar Gabriel

Improvements for MPI

Frontmatter
Improved MPI Collectives for 3D-FFT
Abstract
3-dimensional Fast Fourier Transform (3D FFT) parallel computations are an important part of many scientific calculations. For example, 3D FFT is a critical component of molecular dynamics codes when they compute long range electrostatic computations. Parallel distributed 3D FFT computations involve redistributing intermediate data, which constitutes a substantial portion of overall execution time of these operations. There are two primary methods for handling this communication phase: explicitly packing and unpacking data, or using Message Passing Interface (MPI) derived datatypes. Derived datatypes have several advantages in that they are easy to work with and don’t require explicit memory pack and unpack operations. As such we propose enhancements for derived datatypes in MPI specifically for 3D FFT calculations which improves upon state of the art methods [8] by using MPI_Type_create_subarray to support arbitrary storage orders. Our method reduces the performance issues associated with MPI derived datatype solutions and benefits from avoiding strided-memory operations in FFT execution. Results show that we can speedup even strong scaled 3DFFT by 1.17X to 1.44X using our method over previous state of the art methods.
Yuang Yan, Natasha Kuk, Ryan E. Grant
To Share or Not to Share: A Case for MPI in Shared-Memory
Abstract
The evolution of parallel computing architectures presents new challenges for developing efficient parallelized codes. The emergence of heterogeneous systems has given rise to multiple programming models, each requiring careful adaptation to maximize performance. In this context, we propose reevaluating memory layout designs for computational tasks within larger nodes by comparing various architectures. To gain insight into the performance discrepancies between shared memory and shared-address space settings, we systematically measure the bandwidth between cores and sockets using different methodologies. Our findings reveal significant differences in performance, suggesting that MPI running inside UNIX processes may not fully utilize its intranode bandwidth potential. In light of our work in the MPC thread-based MPI runtime, which can leverage shared memory to achieve higher performance due to its optimized layout, we advocate for enabling the use of shared memory within the MPI standard.
Julien Adam, Jean-Baptiste Besnard, Adrien Roussel, Julien Jaeger, Patrick Carribault, Marc Pérache

MPI Ecosystem

Frontmatter
Dynamic Resource Management for In-Situ Techniques Using MPI-Sessions
Abstract
The computational power of High-Performance Computing (HPC) systems increases continuously and rapidly. Data-intensive applications are designed to leverage the high computational capacity of HPC resources and typically generate a large amount of data for traditional post-processing data analytics. However, the HPC systems’ in-/output (IO) subsystem develops relatively slowly, and the storage capacity is limited. This could lead to limited actual performance and scientific discovery.
In-situ techniques are a partial remedy to these problems by reducing or avoiding the data flow through the IO subsystem to/from the storage. However, in current practice, asynchronous in-situ techniques with static resource management often allocate separate computing resources for executing in-situ task(s), which remain idle if no in-situ work is at hand.
In the present work, we target improving the efficiency of computing resource usage by launching and releasing necessary additional computing resources for in-situ task(s). Our approach is based on extensions for MPI Sessions that enable the required dynamic resource management. In this paper, we propose a basic and an advanced in-situ techniques with dynamic resource management enabled by MPI Sessions, their implementations on two real-world use cases, and a critical analysis of the experimental results.
Yi Ju, Dominik Huber, Adalberto Perez, Philipp Ulbl, Stefano Markidis, Philipp Schlatter, Martin Schulz, Martin Schreiber, Erwin Laure
MPI-BugBench: A Framework for Assessing MPI Correctness Tools
Abstract
MPI’s low-level interface is prone to errors, leading to bugs that can remain dormant for years. MPI correctness tools can aid in writing correct code but lack a standardized benchmark for comparison. This makes it difficult for users to choose the best tool and difficult for developers to gauge their tools’ effectiveness. MPI correctness benchmarks, MPI-CorrBench, the MPI Bugs Initiative, and RMARaceBench have emerged to address this problem. However, comparability is hindered by having separate benchmarks, and none fully reflects real-world MPI usage patterns. Hence, we present MPI-BugBench, a unified MPI correctness benchmark replacing previous efforts. It addresses the shortcomings of its predecessors by providing a single, standardized test harness for assessing tools and incorporates a broader range of real-world MPI usage scenarios. MPI-BugBench is available at https://​git-ce.​rwth-aachen.​de/​hpc-public/​mpi-bugbench.
Tim Jammer, Emmanuelle Saillard, Simon Schwitanski, Joachim Jenke, Radjasouria Vinayagame, Alexander Hück, Christian Bischof
Backmatter
Metadata
Title
Recent Advances in the Message Passing Interface
Editors
Claudia Blaas-Schenner
Christoph Niethammer
Tobias Haas
Copyright Year
2025
Electronic ISBN
978-3-031-73370-3
Print ISBN
978-3-031-73369-7
DOI
https://doi.org/10.1007/978-3-031-73370-3

Premium Partner