Elsevier

Parallel Computing

Volume 31, Issue 7, July 2005, Pages 777-796
Parallel Computing

Latency tolerance through parallelization of time in scientific applications

https://doi.org/10.1016/j.parco.2005.04.008Get rights and content

Abstract

Emerging computing environments, such as the Grid, promise enormous raw computational power. However, effective use of such platforms is often difficult, because conventional spatial decomposition leads to fine granularity, resulting in high communication overhead. We introduce the concept of guided simulations to parallelize along the time domain. Here, we use the fact that typically results of other simulations of closely related problems are available. In this approach, we automatically and dynamically determine a relationship between old simulations and the one being performed, and use this to parallelize along the time domain. We demonstrate the validity of this approach by applying the technique to an important application involving molecular dynamics simulation of nanomaterials. In this application, spatial decomposition is not effective due to the small size of the physical system. However, time parallelization is effective, since the granularity is much coarser. We also mention how this approach can be extended to make it inherently fault tolerant.

Introduction

Computation is becoming an increasingly important component of science. Due to increases in computational power, sufficiently realistic solutions can be obtained for an increasing class of problems of practical and commercial importance. However, for realistic solution of a wider class of problems, much greater computational power and better modeling techniques are required. An example of such a class of applications, which serves as a motivation for our work, is the development of nanomaterials and products, based on a fundamental understanding of the physical phenomena that take place at the scale of a nanometer. While the computational power required for such applications is enormous, emerging computing environments, such as the Grid, promise sufficient raw computational power. The problem here is that it is difficult to make effective use of such systems. Thus new computing paradigms are required to make effective use of such environments, and that is the focus of this paper.

The motivation for our work arises from our interest in computing nanomaterial properties, using Molecular dynamics (MD)1 simulations. Here, the state of a system is defined by the positions and velocities of a specified set of atoms. We wish to compute the state of the system at different points in time. The state at a particular time is computed from the state at the previous time step. MD based simulation, in possible combination with other continuum based techniques (e.g. Finite element methods), is expected to play a very key role in the nanotechnology revolution. However, the use of MD is severely restricted by the extremely small time increment (of the order of femto (10−15) seconds) in each step of the computation, since this leads to a large number of time steps being required. The physical systems can often be small (that is, having a small number of atoms) or of moderate size, and the computational effort then arises from the large number of time steps required. Conventional spatial decomposition is not effective then, due to high communication cost, as explained below.

We next provide a simplistic overview of parallelization, and present the limitations of conventional techniques. Conventional parallelization is by domain decomposition (which we will also refer to as spatial parallelization), where the physical system is distributed across the processors, and each processor is responsible for computing the next state for a portion of the system. However, this computation normally requires access to portions of the system residing on other processors, and so communication is required. Communication costs typically increase proportional to the surface area of the portion of the system on each processor, while the computation cost increases proportional to the volume. Since the surface to volume ratio decreases with increase in volume, the communication overhead becomes relatively small for large systems, and such parallelization is effective then. However, it is not effective when the communication latencies are high compared to the computation cost, such as in the following situations.

  • (1)

    When we want to simulate a small system for a very long time period.

  • (2)

    When massive parallelism causes the volume per processor to become very small.

  • (3)

    In computing platforms with high communication cost, as frequently encountered in distributed computing environments.

This is an important challenge to be addressed, and this paper proposes a solution strategy to this issue, using an important application in computational nanotechnology as an example.

Functional decomposition, where the data is replicated, but each processor performs a different, independent, action on the data, makes the computation more coarse grained, and consequently reduces the communication overhead. It is thus attractive for the three cases mentioned above (small physical systems, massive parallelism, and high communication cost environments). It is, however, difficult to find many such independent operations that can be simultaneously performed on the system. Usually only a small number of such operations can be obtained, and this does not scale with the number of processors.

Our solution strategy is based on the notion of scalable functional decomposition. Here, we develop techniques that enable increasing number of independent operations, as the size of the parallel system increases. We use time parallelization to accomplish this.

Here, we decompose the time domain. Different processors compute the state of the physical system at different points in time. Since each of these computations is an independent set of operations on the same physical system, we have a functional decomposition that scales with the number of processors. The difficulty here is that the computation of the states is defined iteratively, with the results for time ti−1 being needed for computing the state at time ti. So it is not clear how each processor can independently and simultaneously compute the state for a different point in time. The parareal idea introduced by Baffico et al. [1] proposed an approach to deal with this problem. However, the speedup and efficiency obtained were limited, for reasons that we explain later. In this paper, we present our approach, which overcomes the bottlenecks faced by Baffico’s approach. It is based on the idea of guided simulations, where the results of old simulations can be used to speed up the computation of future simulations.

There are two aspects to this work. (i) The general approach based on the concept of guided simulations. In this approach, a relationship is automatically and dynamically determined between a previously completed simulation and the simulation being performed, and this relationship is used to parallelize the current simulation. (ii) The specific techniques that are used to determine a relationship between different simulations. These are application dependent to a large extent.

Though the mathematical details of the technique varies from problem to problem, the conceptual framework remains unaltered, and is not restricted to the specific application being considered. So we first describe the general features of the class of relevant scientific problems in Section 2. We then present the parareal approach and its limitations in Section 3. We describe our approach to time parallelization, using guided simulations, in Section 4. This is followed by a discussion of our application in Section 5, and the description of our method for predicting the relationship between simulations in Section 6, followed by experimental results in Section 7. We summary our conclusions in Section 8, and also mention how our technique can be extended to make it inherently fault tolerant.

Section snippets

The class of scientific applications considered

A large class of scientific application follows the general paradigm that one tracks the evolution of the state of a system with time, where the change in state is modeled through a differential equation, along with boundary/initial conditions. For example, we can describe the model generally as: dS/dt = f(S, t), where S is the state, t is time, f is some function of S and t, and dS/dt gives the rate of change of the state. The physical domain where the equation is applicable is prescribed, along

The parareal approach

Time parallelization in this strategy is accomplished through an iterative process as illustrated in Fig. 1.

Let us suppose that we want to compute the state for P time steps. The computation will involve P processors with ranks in [1, P]. Each processor i wishes to start simulating with the state of the system at some time ti−1 and evolve it to the correct state at some time ti, by performing the computation specified by F of Eq. (1). If they can do this, then each processor would have performed

Guided simulations for time parallelization

As in the parareal approach, different processors compute the state of the physical system at different points in time. (However, we divide the time period to be simulated into intervals, such that the number of time intervals simulated is much greater than the number of processors available. The choice of the size of each interval involves a trade-off between prediction efficiency and communication costs, and is explained toward the end of this section.) Our approach is to dynamically

Specific problem

We consider a carbon nanotube (CNT) that has one end fixed, and the other end moved at a constant velocity u as shown in Fig. 3. Different simulations use different velocities. This is a common type of simulation to test the strength and elastic modulus of a material [2]. We wish to use as small a nanotube as possible, since our aim is to achieve as large a time scale as possible. Our application (a multiscale model in which the CNT properties, describing the nanoscale behavior of the CNT, are

A predictor for the CNT application

In this predictor, we predict the change in each coordinate independently. In the description here, we normalize all the coordinates so that they are in [0, 1], by letting the origin be 0 and then dividing by the length of the CNT along that coordinate direction. It is easy to change between the actual and normalized coordinates. Using the normalized coordinates is advantageous, because it enables us to use base and current simulations that use CNTs of different sizes. Let xt represent a

Experimental parameters

Both the base and the current simulation pull one end of a carbon nanotube containing 1000 atoms at constant velocities, but with different velocities for the base and the current simulations. The predictor, of course, did not make use of the fact the the same number of atoms were simulated. The Tersoff–Brenner potential is used in the MD simulations. Use of this potential is more computationally intensive than many other potentials, but is known for its accuracy in carbon nanotube

Conclusions and future work

We have proposed a strategy which can enable high latency environments to be effective for problems where a typical spatial decomposition will lead to too fine a granularity. We have demonstrated its effectiveness on an important and real application of great technological significance.

This method can be combined with spatial parallelization for even greater effectiveness. Of course, further work needs to be performed to extend the applications for which this technique can be used.

We note that

Acknowledgements

This work was funded by NSF grant # CMS-0403746. We also wish to thank Sri S.S. Baba for help with debugging the code, among other things.

References (7)

There are more references available in the full text version of this article.

Cited by (14)

  • Communication-aware adaptive Parareal with application to a nonlinear hyperbolic system of partial differential equations

    2018, Journal of Computational Physics
    Citation Excerpt :

    In that case the communication pattern becomes dominated by a few large time-subdomain interfaces that must be communicated between node-groups. The potential of this latency tolerant communication pattern was investigated in early papers by Srinivasan and Chandra [32], Xiao and Aubanel [37]. Unfortunately convergence is slow for most problems when using parallel-in-time integration of long intervals with the standard Parareal algorithm as has been demonstrated in many early papers, see [21] for an overview, and later established rigorously in [14].

  • Hybrid dynamic iterations for the solution of initial value problems

    2011, Journal of Parallel and Distributed Computing
    Citation Excerpt :

    The Parareal method can be considered an example of this [1,11]. Dynamic data-driven time parallelization [15,16,22] uses results from prior, related, runs to construct a coarse-grained model. Conventional parallelization of ODEs is through distribution of the state space, as mentioned earlier.

  • Scheduling of tasks in the parareal algorithm

    2011, Parallel Computing
    Citation Excerpt :

    This is due to a fundamental tradeoff between the reduction of the time required for an inherently sequential part of the algorithm and an increase in the number of iterations required to converge. This limitation has motivated the search for alternatives, such as an approach based on the relationship between old and new simulations presented by Srinivasan and Chandra in 2005 [16]. However, previous analysis of the parareal algorithm in the literature [12] did not consider the efficient scheduling of tasks.

  • Fault tolerant algorithms for heat transfer problems

    2008, Journal of Parallel and Distributed Computing
View all citing articles on Scopus
View full text