Latency tolerance through parallelization of time in scientific applications

doi:10.1016/j.parco.2005.04.008

Parallel Computing

Volume 31, Issue 7, July 2005, Pages 777-796

https://doi.org/10.1016/j.parco.2005.04.008 Get rights and content

Abstract

Emerging computing environments, such as the Grid, promise enormous raw computational power. However, effective use of such platforms is often difficult, because conventional spatial decomposition leads to fine granularity, resulting in high communication overhead. We introduce the concept of guided simulations to parallelize along the time domain. Here, we use the fact that typically results of other simulations of closely related problems are available. In this approach, we automatically and dynamically determine a relationship between old simulations and the one being performed, and use this to parallelize along the time domain. We demonstrate the validity of this approach by applying the technique to an important application involving molecular dynamics simulation of nanomaterials. In this application, spatial decomposition is not effective due to the small size of the physical system. However, time parallelization is effective, since the granularity is much coarser. We also mention how this approach can be extended to make it inherently fault tolerant.

Introduction

Computation is becoming an increasingly important component of science. Due to increases in computational power, sufficiently realistic solutions can be obtained for an increasing class of problems of practical and commercial importance. However, for realistic solution of a wider class of problems, much greater computational power and better modeling techniques are required. An example of such a class of applications, which serves as a motivation for our work, is the development of nanomaterials and products, based on a fundamental understanding of the physical phenomena that take place at the scale of a nanometer. While the computational power required for such applications is enormous, emerging computing environments, such as the Grid, promise sufficient raw computational power. The problem here is that it is difficult to make effective use of such systems. Thus new computing paradigms are required to make effective use of such environments, and that is the focus of this paper.

The motivation for our work arises from our interest in computing nanomaterial properties, using Molecular dynamics (MD)¹ simulations. Here, the state of a system is defined by the positions and velocities of a specified set of atoms. We wish to compute the state of the system at different points in time. The state at a particular time is computed from the state at the previous time step. MD based simulation, in possible combination with other continuum based techniques (e.g. Finite element methods), is expected to play a very key role in the nanotechnology revolution. However, the use of MD is severely restricted by the extremely small time increment (of the order of femto (10⁻¹⁵) seconds) in each step of the computation, since this leads to a large number of time steps being required. The physical systems can often be small (that is, having a small number of atoms) or of moderate size, and the computational effort then arises from the large number of time steps required. Conventional spatial decomposition is not effective then, due to high communication cost, as explained below.

We next provide a simplistic overview of parallelization, and present the limitations of conventional techniques. Conventional parallelization is by domain decomposition (which we will also refer to as spatial parallelization), where the physical system is distributed across the processors, and each processor is responsible for computing the next state for a portion of the system. However, this computation normally requires access to portions of the system residing on other processors, and so communication is required. Communication costs typically increase proportional to the surface area of the portion of the system on each processor, while the computation cost increases proportional to the volume. Since the surface to volume ratio decreases with increase in volume, the communication overhead becomes relatively small for large systems, and such parallelization is effective then. However, it is not effective when the communication latencies are high compared to the computation cost, such as in the following situations.

(1)
When we want to simulate a small system for a very long time period.
(2)
When massive parallelism causes the volume per processor to become very small.
(3)
In computing platforms with high communication cost, as frequently encountered in distributed computing environments.

This is an important challenge to be addressed, and this paper proposes a solution strategy to this issue, using an important application in computational nanotechnology as an example.

Functional decomposition, where the data is replicated, but each processor performs a different, independent, action on the data, makes the computation more coarse grained, and consequently reduces the communication overhead. It is thus attractive for the three cases mentioned above (small physical systems, massive parallelism, and high communication cost environments). It is, however, difficult to find many such independent operations that can be simultaneously performed on the system. Usually only a small number of such operations can be obtained, and this does not scale with the number of processors.

Our solution strategy is based on the notion of scalable functional decomposition. Here, we develop techniques that enable increasing number of independent operations, as the size of the parallel system increases. We use time parallelization to accomplish this.

Here, we decompose the time domain. Different processors compute the state of the physical system at different points in time. Since each of these computations is an independent set of operations on the same physical system, we have a functional decomposition that scales with the number of processors. The difficulty here is that the computation of the states is defined iteratively, with the results for time t_i−1 being needed for computing the state at time t_i. So it is not clear how each processor can independently and simultaneously compute the state for a different point in time. The parareal idea introduced by Baffico et al. [1] proposed an approach to deal with this problem. However, the speedup and efficiency obtained were limited, for reasons that we explain later. In this paper, we present our approach, which overcomes the bottlenecks faced by Baffico’s approach. It is based on the idea of guided simulations, where the results of old simulations can be used to speed up the computation of future simulations.

There are two aspects to this work. (i) The general approach based on the concept of guided simulations. In this approach, a relationship is automatically and dynamically determined between a previously completed simulation and the simulation being performed, and this relationship is used to parallelize the current simulation. (ii) The specific techniques that are used to determine a relationship between different simulations. These are application dependent to a large extent.

Though the mathematical details of the technique varies from problem to problem, the conceptual framework remains unaltered, and is not restricted to the specific application being considered. So we first describe the general features of the class of relevant scientific problems in Section 2. We then present the parareal approach and its limitations in Section 3. We describe our approach to time parallelization, using guided simulations, in Section 4. This is followed by a discussion of our application in Section 5, and the description of our method for predicting the relationship between simulations in Section 6, followed by experimental results in Section 7. We summary our conclusions in Section 8, and also mention how our technique can be extended to make it inherently fault tolerant.

Section snippets

The class of scientific applications considered

A large class of scientific application follows the general paradigm that one tracks the evolution of the state of a system with time, where the change in state is modeled through a differential equation, along with boundary/initial conditions. For example, we can describe the model generally as: dS/dt = f(S, t), where S is the state, t is time, f is some function of S and t, and dS/dt gives the rate of change of the state. The physical domain where the equation is applicable is prescribed, along

The parareal approach

Time parallelization in this strategy is accomplished through an iterative process as illustrated in Fig. 1.

Let us suppose that we want to compute the state for P time steps. The computation will involve P processors with ranks in [1, P]. Each processor i wishes to start simulating with the state of the system at some time t_i−1 and evolve it to the correct state at some time t_i, by performing the computation specified by F of Eq. (1). If they can do this, then each processor would have performed

Guided simulations for time parallelization

As in the parareal approach, different processors compute the state of the physical system at different points in time. (However, we divide the time period to be simulated into intervals, such that the number of time intervals simulated is much greater than the number of processors available. The choice of the size of each interval involves a trade-off between prediction efficiency and communication costs, and is explained toward the end of this section.) Our approach is to dynamically

Specific problem

We consider a carbon nanotube (CNT) that has one end fixed, and the other end moved at a constant velocity u as shown in Fig. 3. Different simulations use different velocities. This is a common type of simulation to test the strength and elastic modulus of a material [2]. We wish to use as small a nanotube as possible, since our aim is to achieve as large a time scale as possible. Our application (a multiscale model in which the CNT properties, describing the nanoscale behavior of the CNT, are

A predictor for the CNT application

In this predictor, we predict the change in each coordinate independently. In the description here, we normalize all the coordinates so that they are in [0, 1], by letting the origin be 0 and then dividing by the length of the CNT along that coordinate direction. It is easy to change between the actual and normalized coordinates. Using the normalized coordinates is advantageous, because it enables us to use base and current simulations that use CNTs of different sizes. Let x_t represent a

Experimental parameters

Both the base and the current simulation pull one end of a carbon nanotube containing 1000 atoms at constant velocities, but with different velocities for the base and the current simulations. The predictor, of course, did not make use of the fact the the same number of atoms were simulated. The Tersoff–Brenner potential is used in the MD simulations. Use of this potential is more computationally intensive than many other potentials, but is known for its accuracy in carbon nanotube

Conclusions and future work

We have proposed a strategy which can enable high latency environments to be effective for problems where a typical spatial decomposition will lead to too fine a granularity. We have demonstrated its effectiveness on an important and real application of great technological significance.

This method can be combined with spatial parallelization for even greater effectiveness. Of course, further work needs to be performed to extend the applications for which this technique can be used.

We note that

Acknowledgements

This work was funded by NSF grant # CMS-0403746. We also wish to thank Sri S.S. Baba for help with debugging the code, among other things.

References (7)

A. Nakano et al.
Parallel multiple-time-step molecular dynamics with three-body interaction
Comput. Phys. Commun.
(1993)
S. Namilae et al.
Mechanical behavior of functionalized nanotubes
Chem. Phys. Lett.
(2004)
L. Baffico et al.
Parallel-in-time molecular-dynamics simulations
Phys. Rev. E (Stat., Nonlinear, Soft Matter Phys.)
(2002)

There are more references available in the full text version of this article.

Cited by (14)

Communication-aware adaptive Parareal with application to a nonlinear hyperbolic system of partial differential equations
2018, Journal of Computational Physics
Citation Excerpt :
In that case the communication pattern becomes dominated by a few large time-subdomain interfaces that must be communicated between node-groups. The potential of this latency tolerant communication pattern was investigated in early papers by Srinivasan and Chandra [32], Xiao and Aubanel [37]. Unfortunately convergence is slow for most problems when using parallel-in-time integration of long intervals with the standard Parareal algorithm as has been demonstrated in many early papers, see [21] for an overview, and later established rigorously in [14].
In the strong scaling limit, the performance of conventional spatial domain decomposition techniques for the parallel solution of PDEs saturates. When sub-domains become small, halo-communication and other overhead come to dominate. A potential path beyond this scaling limit is to introduce domain-decomposition in time, with one such popular approach being the Parareal algorithm which has received a lot of attention due to its generality and potential scalability. Low efficiency, particularly on convection dominated problems, has however limited the adoption of the method. In this paper we demonstrate trough large-scale numerical experiments that it is possible not only to obtain time-parallel speedup on the non-linear shallow water wave equation, but also that we may obtain parallel acceleration beyond what is possible using conventional spatial domain-decomposition techniques alone. Two factors were essential in achieving this. First, for Parareal to converge on the hyperbolic problem we used an approximate Riemann solver as the preconditioner. This preconditioner introduces only dissipative errors with respect to the 3rd order accurate WENO-RK discretization used to solve the PDE system. The preconditoner is comparatively expensive and convergence is slow unless time-subdomains are short. We therefore introduce a new scheduler that we denote Communication Aware Adaptive Parareal (CAAP). CAAP increases obtainable speed-up by minimizing the time-subdomain length without making communication of time-subdomains too costly whilst also adaptively overlapping consecutive cycles of Parareal so to mitigate the impact of a relatively expensive coarse operator.
Fault-tolerant grid-based solvers: Combining concepts from sparse grids and MapReduce
2013, Procedia Computer Science
A key issue confronting petascale and exascale computing is the growth in probability of soft and hard faults with increasing system size. A promising approach to this problem is the use of algorithms that are inherently fault tolerant. We introduce such an algorithm for the solution of partial differential equations, based on the sparse grid approach. Here, the solution of multiple component grids are efficiently combined to achieve a solution on a full grid. The technique also lends itself to a (modified) MapReduce framework on a cluster of processors, with the map stage corresponding to allocating each component grid for solution over a subset of the processors, and the reduce stage corresponding to their combination. We describe how the sparse grid combination method can be modified to robustly solve partial differential equations in the presence of faults. This is based on a modified combination formula that can accommodate the loss of one or two component grids. We also discuss accuracy issues associated with this formula. We give details of a prototype implementation within a MapReduce framework using the dynamic process features and asynchronous message passing facilities of MPI. Results on a two-dimensional advection problem show that the errors after the loss of one or two sub-grids are within a factor of 3 of the sparse grid solution in the presence of no faults. They also indicate that the sparse grid technique with four times the resolution has approximately the same error as a full grid, while requiring (for a sufficiently high resolution) much lower computation and memory requirements. We finally outline a MapReduce variant capable of responding to faults in ways other than re-scheduling of failed tasks. We discuss the likely software requirements for such a flexible MapReduce framework, the requirements it will impose on users’ legacy codes, and the system's runtime behavior.
Hybrid dynamic iterations for the solution of initial value problems
2011, Journal of Parallel and Distributed Computing
Citation Excerpt :
The Parareal method can be considered an example of this [1,11]. Dynamic data-driven time parallelization [15,16,22] uses results from prior, related, runs to construct a coarse-grained model. Conventional parallelization of ODEs is through distribution of the state space, as mentioned earlier.
Many scientific problems are posed as Ordinary Differential Equations (ODEs). A large subset of these are initial value problems, which are typically solved numerically. The solution starts by using a known state space of the ODE system to determine the state at a subsequent point in time. This process is repeated several times. When the computational demand is high due to large state space, parallel computers can be used efficiently to reduce the time to solution. Conventional parallelization strategies distribute the state space of the problem amongst cores and distribute the task of computing for a single time step amongst the cores. They are not effective when the computational problems have fine granularity, for example, when the state space is relatively small and the computational effort arises largely from the long time span of the initial value problem. We propose a hybrid dynamic iterations method¹ which combines conventional sequential ODE solvers with dynamic iterations to parallelize the time domain. Empirical results demonstrate a factor of two to four improvement in performance of the hybrid dynamic iterations method over a conventional ODE solver on an 8 core processor. Compared to Picard iterations (also parallelized in the time domain), the proposed method shows better convergence and speedup results when high accuracy is required.
Scheduling of tasks in the parareal algorithm
2011, Parallel Computing
Citation Excerpt :
This is due to a fundamental tradeoff between the reduction of the time required for an inherently sequential part of the algorithm and an increase in the number of iterations required to converge. This limitation has motivated the search for alternatives, such as an approach based on the relationship between old and new simulations presented by Srinivasan and Chandra in 2005 [16]. However, previous analysis of the parareal algorithm in the literature [12] did not consider the efficient scheduling of tasks.
Parallelization of partial differential equations (PDEs) by time decomposition has attracted much interest, mainly due to its potential to enable very long time simulations beyond what is possible using spatial domain decomposition. However, there has only been limited performance analysis of the parareal algorithm in the literature, ignoring the efficient scheduling of tasks. This paper presents a detailed study of the scheduling of tasks in the parareal algorithm that achieves significantly better efficiency than the usual algorithm. Two algorithms are proposed, one which uses a manager–worker paradigm with overlap of sequential and parallel phases, and a second that is completely distributed. Experiments were conducted with the 2D heat equation. It was found that the rate of convergence decreases as the number of processors increases, in the case of strong scaling (fixed time interval). However, for weak scaling results the rate of convergence was unaffected by the number of processors. The results of this paper suggest that the parareal algorithm is a promising approach to solving long time evolution problems, particularly when the goal is simulation of longer times using more processors. It also exhibits characteristics that make it particularly suitable for execution on heterogeneous computational grids, such as low communication overhead and easy accommodation of different processor and network speeds.
Fault tolerant algorithms for heat transfer problems
2008, Journal of Parallel and Distributed Computing
With the emergence of new massively parallel systems in the high performance computing area allowing scientific simulations to run on thousands of processors, the mean time between failures of large machines is decreasing from several weeks to a few minutes. The ability of hardware and software components to handle these singular events called process failures is therefore getting increasingly important. In order for a scientific code to continue despite a process failure, the application must be able to retrieve the lost data items. The recovery procedure after failures might be fairly straightforward for elliptic and linear hyperbolic problems. However, the reversibility in time for parabolic problems appears to be the most challenging part because it is an ill-posed problem. This paper focuses on new fault-tolerant numerical schemes for the time integration of parabolic problems. The new algorithm allows the application to recover from process failures and to reconstruct numerically the lost data of the failed process(es) avoiding the expensive roll-back operation required in most checkpoint/restart schemes. As a fault tolerant communication library, we use the fault tolerant message passing interface developed by the Innovative Computing Laboratory at the University of Tennessee. Experimental results show promising performances. Indeed, the three-dimensional parabolic benchmark code is able to recover and to keep on running after failures, adding only a very small penalty to the overall time of execution.
Lossy data compression reduces communication time in hybrid time-parallel integrators
2018, Computing and Visualization in Science

View all citing articles on Scopus

View full text