1 Introduction
2 Asynchronicity and Fault Tolerance
2.1 Abstract Layer for Asynchronicity
Dune:: Communication
and a specific implementation Dune:: MPICommunication
.std::future
class cannot be used for this purpose.std::future
, task-based TBB::future
, and our new MPIFuture
are available, usability greatly benefits from a dynamically typed interface. This is a reasonable approach, as std::future
is using a dynamical interface already and also the MPI operations are coarse grained, so that the additional overhead of virtual function calls is negligible. At the same time the user expects a future to offer value semantics, which contradicts the usual pointer semantics used for dynamic polymorphism. In Exa-Dune we decided to implement type-erasure to offer a clean and still flexible user interface. An MPIFuture
is responsible for handling all states associated with an MPI operation. MPI_Request
and MPI_Status
to access information on the current operation and it holds buffer objects, which manage the actual data. These buffers offer a great additional value, as we do not access the raw data directly, but can include data transformation and varying ownership. For example it is now possible to directly send an std::vector<double>
, where the receiver automatically resizes the std::vector
according to the incoming data stream.2.2 Parallel C+ + Exception Handling
Dune::MPIGuard
, that previously only implemented the scope guard concept to detect and react on local exceptions. Our extension revokes the MPI communicator using the ULFM functionality if an exception is detected, so that it is now possible to use communication inside a block with scope guard. This makes it superfluous to call the finalize
and reactivate
methods of the MPIGuard
before and after each communication.MPIGuard
and recover the communicator in a node loss scenario. In this example, an exception that is thrown only on a few ranks in do_something ()
will not lead to a deadlock, since the MPIGuard
would revoke the communicator. Details of the implementation and further descriptions are available in a previous publication [18]. We provide the “black-channel” fallback implementation as a standalone version.4 This library uses the P-interface of the MPI standard, which makes it possible to redefine MPI functions. At the initialization of the MPI setting the library creates an opaque communicator, called blackchannel, on which a pending MPI_Irecv
request is waiting. Once a communicator is revoked, the revoking rank sends messages to the pending blackchannel request. To avoid deadlocks, we use MPI_Waitany
to wait for a request, which listens also for the blackchannel request. All blocking communication is redirected to non-blocking calls using the P-interface. The library is linked via LD_PRELOAD
which makes it usable without recompilation and could be removed easily once a proper ULFM implementation is available in MPI.2.3 Compressed in-Memory Checkpointing for Linear Solvers
MPIGuard
and propagated to all other ranks, so that all ranks will jump to the catch-block.2.4 Communication Aware Krylov Solvers
ScalarProduct
interface by a function which can be passed multiple pairs of vectors for which the scalar product should be computed. The function returns a Future
which contains a std::vector< field_type >
, once it has finished.Required memory | Additional computational effort | Global reductions | |
---|---|---|---|
PCG | 4N
| – | 2 |
Chronopoulos and Gear | 6N
| 1N
| 1 |
Gropp | 6N
| 2N
| 2 overlapped |
Ghysels and Vanroose | 10N
| 5N
| 1 overlapped |
Dune::CGSolver
, which is the current CG implementation in Dune. We use an SSOR preconditoner in an additive overlapping Schwarz setup. The problem matrix is generated from a 5-star Finite Difference model problem. With less cores the current implementation is faster than our optimized one. But with higher core count our optimized version outperforms it. The test was executed on the helics3 cluster of the University on Heidelberg, with 5600 cores on 350 nodes. We expect that on larger systems the speedup will further increase, since the communication is more expensive. The overlap of communication and computation does not really come into play, since the currently used MPI version does not support it completely.
3 Hardware-Aware, Robust and Scalable Linear Solvers
3.1 Strong Smoothers on the GPU: Fast Approximate Inverses with Conventional and Machine Learning Approaches
3.2 Autotuning with Artificial Neural Networks
3.3 Further Development of Sum-Factorized Matrix-Free DG Methods
3.4 Hybrid Solvers for Discontinuous Galerkin Schemes
3.5 Horizontal Vectorization of Block Krylov Methods
4 Adaptive Multiscale Methods
4.1 Continuous Problem and Discretization
4.2 Model Reduction
4.3 Implementation
MPIFuture
described in Sect. 2.1. This will allow the rank to continue in its own enrichment process until the updated basis is actually needed in a subsequent step.