Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures

Ciznicki, Milosz; Kurowski, Krzysztof; Weglarz, Jan

doi:10.1007/s10586-016-0686-2

Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures

Open access
Published: 29 November 2016

Volume 20, pages 2535–2549, (2017)
Cite this article

Download PDF

You have full access to this open access article

Cluster Computing Aims and scope Submit manuscript

Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures

Download PDF

Milosz Ciznicki^1,2,
Krzysztof Kurowski² &
Jan Weglarz¹

1836 Accesses
9 Citations
Explore all metrics

Abstract

Performance of high-end supercomputers will reach the exascale through the advent of core counts in billions. However, in the upcoming exascale computing era it is important not only to focus on the performance, but also on scalability of fine-grained parallel applications, data locality and energy aware scheduling within the parallel code. In fact, parallel applications need to change even now by redesigning algorithms and data structures respectively to take advantage of the recent improvements in energy efficiency of heterogeneous computing hardware, including multicore processors and GPU accelerators. Over the next few years one of the biggest challenges for exascale will be the ability of parallel applications to fully exploit locality which will, in turn, be required to achieve expected performance and energy efficiency. Future highly parallel applications will have to deal with deep memory hierarchies taking into account energy cost in moving data off-chip. Therefore, they will have to apply new coordinated scheduling approaches to balance energy aware resource utilization and minimize work starvation during runtime. As new constraints and limits on memory bandwidth and energy will play a key role in high performance computing (HPC) in the future, more sophisticated and dynamic scheduling techniques will be needed and applied within the parallel code. In this paper we focus on an energy-aware distribution of the stencil workload on heterogeneous processors. Our analysis of energy and performance models focused on relevant class of stencil computations to explore the relationship between task scheduling algorithms and energy constraints. More precisely, we search for a schedule which minimizes the energy usage within a specified computation’s deadline of the stencil workload on heterogeneous architectures. Since the problem is computationally intractable, we present an integer linear programming formulation for finding optimal schedules. As finding optimal schedules is time consuming we have developed four heuristics and tested them experimentally with respect to optimal solutions. In our work we focus on a single node configurations with heterogeneous processors. These configurations represent the state of the art multi- and many-core architectures.

Runtime and energy constrained work scheduling for heterogeneous systems

Article 16 May 2022

Scalable Parallelization of Stencils Using MODA

Performance Engineering and Energy Efficiency of Building Blocks for Large, Sparse Eigenvalue Computations on Heterogeneous Supercomputers

1 Introduction

Stencil computations as relevant class of applications occur in many HPC codes on block-structured grids for modelling various physical phenomena, e.g. for computational fluid dynamics, geometric modelling, solving partial differential equations or image and video processing [1,2,3,4,5]. As computing time and memory usage grow linearly with the number of array elements in stencil computations our research targets highly parallel implementations of stencil codes together with task scheduling and optimization techniques taking into consideration energy cost and data locality [6,7,8,9,10]. We have proved during our experimental studies that recent changes introduced in heterogeneous computing hardware resulted in different performance and energy characteristics that are critical for highly efficient and scalable stencil computations [11]. As shown in [12, 13], the overall performance of stencil computations is memory bound. One should note that many existing HPC architectures mainly focus on floating point performance [14]. However, only a partial and limited usage of the floating point units in a given computing architecture is possible today and may reduce energy cost without the performance degradation. Moreover, many latest improvements introduced in dynamic power management policies at the hardware level, e.g. dynamic voltage and frequency scaling (DVFS) or even switching off an entire unit block of a chip (clock gating), can lead to significant reduction in the energy required for memory-bound workloads. Advanced dynamic power management policies give new opportunities for scheduling tasks within the fine-grained parallel code as users are able to control the utilization of various functional units in heterogeneous computing hardware, e.g. turn on and off dynamically individual cores, change on-demand the frequency of a small processing and communication units or even put portions of cache memory at specific sleep states during runtime.

In our previous work [15] we performed an exhaustive evaluation of the key characteristics that have a relevant impact on the performance and energy usage of a stencil computation running on a certain processing unit. Based on these characteristics in this article, we present an energy-aware ILP model that distributes stencil computations to heterogeneous processors and minimizes the schedule energy cost while meeting the computation’s deadline. The distribution of stencil computations is done on the blocks obtained from the decomposition of the computational domain. The computational domain is a Cartesian grid on which the stencil computations are defined. The optimization space of the model shows that the best strategy depends not only on load balancing the problem size between the processing units, the processing units specification, and the stencils employed, but also on detailed mapping of the communication dependencies of the blocks to the communication topology of respective processing units. No previous work has attempted to account for the time and energy simultaneously in the context of the distribution of the stencil computations between processing units. We also developed new heuristics that schedule example workloads in real time. The developed heuristics attempt to include the communication overhead in the distribution process. The described algorithms have been tested experimental using the state of the art multi- and many-core architectures. In our work we focus the experiments on a single node configurations with heterogeneous processors.

The paper is organized as follows. In Sect. 2 the related work is discussed. The key properties that have an influence on energy usage are defined in Sect. 3. The scheduling problem is introduced in Sect. 4. Performance and energy models are introduced in Sect. 5. Section 6 describes the integer linear programming (ILP) model. The dynamic scheduling policies are described in Sect. 7. Section 8 presents experiments using a 3D Laplacian stencil defined on different grid topologies using several CPU–GPU configurations. Section 9 concludes our experiments and presents a future work.

2 Related work

In general, considered stencil calculations perform global sweeps through data structures that are typically much larger than the capacity of the available data caches available within processing units. Additionally, accessing data in main memory within the hardware is not fast enough and we often have to deal with the traffic between local cache and main memory. Therefore, many researchers have already tried to exploit data locality in stencil computations by performing operations on cache-sized blocks of data after domain decomposition [16], after time decomposition [17] or proposed cache-aware optimisation algorithms on many-core modern processors [18].

In consequence, there exist frameworks that try to ease the implementation of the stencil calculations. The user writes single stencil code in a framework’s specific language which during a compilation is translated to a target architecture. The frameworks distribute the computations to employ multiple processors. The distribution involves the decomposition of the Cartesian domain to overlapping blocks. The overlap, called halo region, is needed to correctly update the decomposed block on borders. Each block is updated by a single processor. The minimal size of the overlap depends on the stencil pattern. The stencil pattern defines which neighbouring points are used during stencil computations. For example, Physis [19] uniformly decomposes the global domain over all the accelerators as instructed by a user-controllable parameter. The user has to experimentally determine which decomposition provides the highest performance. The framework focuses only on the GPU architecture. Similarly, work in [1] utilises a simple decomposition method with uniform partition where each processor and accelerator receives blocks of the same size. On the other hand, authors in [20] provide a method that allows programmers to partition the data contiguously between CPU and GPU within a single node. Unlike our work, their approach does not allow to find an optimal distribution of the domain between heterogeneous architectures in terms of time and energy costs. What is more, there is a lack of careful analyses of stencil optimizations and performance modelling connecting specific properties such as communication and locality with architectural time and energy costs.

Moreover, performance and energy models for modern heterogeneous computing architectures incorporating specialized processing capabilities should be flexible and extendable to explore recent properties of heterogeneous hardware units. A good example is the roofline model which allows a programmer to model, predict, and analyse an individual kernel performance given an architecture communication and computation capabilities [21]. In this approach an application is modelled simply by the ratio of useful operations to memory operations. The roofline model can predict the performance of a simple von Neumann architecture with two levels of memory as well as the more complex design with a multi-level memory hierarchy. It has been successfully used to model the performance of many applications on the multi-core and many-core processors [22]. Recently, it has been extended to model the energy consumption in GPUs [23]. In the new model the authors have assumed that each operation has a fixed energy cost and a fixed data movement cost while the constant energy cost is linear in time. The constant power depends on both a hardware and an algorithm and includes both static and leakage power management. However, the proposed model does not include dynamic power management by charging and discharging gate capacitance. The authors assumed that time per work (arithmetic) operation and time per memory operation are estimated with the hardware peak throughput values, whereas the energy cost is estimated using a linear regression based on real experiments. Another set of extensions to the roofline model have been proposed in [24] to model energy on dual multi-core CPU with three-level cache hierarchy. In this approach the dynamic power management was modelled as a second degree polynomial, based on real benchmark data, that scales linearly with the number of active cores up to the saturation point. The authors assumed that the dynamic power depends quadratically on the frequency. In the saturation point the energy to solution grows with the number of used cores, that is proportional to dynamic power, while the time to solution stays constant. In our article we are providing two examples of architectures CPUs and GPUs. However, the presented model can be utilized with other architectures as well, for instance Intel Xeon Phi or ARM. Antoher example is an energy model presented in [25] to evaluate the cost of parallel algorithms for GPU. Based on the energy model they propose the method for the energy scalability to easy the selection of the optimal number of blocks.

3 Stencil properties

In our previous work [15] we experimentally discovered the key characteristics that have a relevant impact on the performance and energy usage of a stencil computation running on a certain processing unit (PU). We tested the performance and energy usage of an example 3D Laplacian stencil on eight core Intel Xeon E5-2670@2.6GHz CPU and Kepler K20m GPU using multiple of frequency and voltage pairs, called P-states. Firstly, the maximum performance can be reached with a lower number of cores than available. Secondly, to minimize the energy usage it is more important to reduce the frequency than the number of cores used. What is more, in case of CPU, DRAM may use up to $60\%$ of the energy. Thus, the data movement consumes the most of the power. Finally, the lowest energy usage may be reached with not the maximum performance. To summarise the analysis, the stencil computation $u\in \mathcal {T}$, called task, is described by the following parameters:

1.
The number of arithmetic operations per grid point $W_{u,p}$ on a processor p,
2.
The number of required bytes to update a grid point $Q_{u,p}$ on a processor p,
3.
The block dimensions $d_{u}=[d_{u}^x,d_{u}^y,d_{u}^z]^{T}$.

The processor $p\in \mathcal {P}$ has following properties:

1.
The set of available frequencies $\mathcal {F}{=}\{f_{p1},f_{p2},\ldots ,f_{pn}\}$,
2.
The set of available cores $\mathcal {C}=\{c_{p1},c_{p2},\ldots ,c_{pm}\}$,
3.
The set of states $\mathcal {L}=\left\{ (f,c):f\in \mathcal {F}\wedge c\in \mathcal {C}\right\} $, where $l\in \mathcal {L}$ is a selected state,
4.
The sustained bandwidth to the main memory $b_{p,l}$ in bytes per second based on the state l.
5.
The performance $h_{p,l}$ in the floating-point operations per second based on the state l.

4 Problem formulation

As showed in the previous section that the data locality has the highest influence on the energy usage, it has encouraged us to focus our research on a stencil workload scheduling using heterogeneous computing architectures to minimize the energy usage while meeting the computation’s deadline. The scheduling problem is defined by a set $\mathcal {P}$ of m processors and a workload $\mathcal {T}={T_{1},T_{2},...,T_{n}}$ of n dependent tasks.

A considered workload represents a stencil defined on a structural grid. Each point on a grid is updated with a strict pattern, see Fig. 1. The pattern defines which neighbouring points are used during a stencil computation. A single update of the whole grid is called a timestep. In our approach we focus on an explicit method where a current timestep is updated by using values of the grid points from a previous timestep. The considered heterogeneous hardware includes unrelated processing units (PUs) and the same stencil computation takes different execution times on them. Based on our experimental studies we distinguished two different unrelated processing units: central processing units (CPUs) and graphic processing units (GPUs).

The workload contains the set of dependent tasks. The block decomposition of the structural grid updated by the stencil forms the workload of tasks with the communication dependencies. A task represents a single block of the decomposed grid. We assume that the grid is decomposed on equally sized blocks. We assume that a given task may be processed by a single processing unit at a time and each processing unit may execute several tasks.

The tasks are represented by a directed graph defined by a tuple $G=(V,E)$ where V denotes the set of tasks and E represents the set of edges. For simplicity we assume that the task $T_u = u$ and the processor is depicted by p. Each edge $(u,v)\in E$ defines a communication between the tasks $u,v \in V$. The communication load $d_{u,v}$ on the edge (u, v) depicts the number of grid cells exchanged between tasks. The model assumes a fully connected network of heterogeneous processors with heterogeneous communication links. If tasks u and v are executed on different processors $p,k \in \mathcal {P}$, they cause the time $t_{p,k}^{e}$ and the energy $e_{p,k}^{e}$ penalty required to exchange a single grid cell between the processors p and k. If both tasks are scheduled on the same processor, then the communication time and the communication energy are equal to zero. The computation load $w_{u}$ describes the number of grid cells provided by the task u. The computation time and the energy cost to update the single cell on the processor p are represented by $t_{u,p,l}^{c}$ and $e_{u,p,l}^{c}$ respectively; see (3), (4). The idle power $P_p^{idle}$ depicts the power used when no computations are executed on processor p. The memory size $m_p$ represents the maximum number of grid cells that can be computed on processor p. The total communication time and the total communication energy to exchange all data are represented by $t^{e}$ and $e^{e}$ respectively. Total execution time $t^{t}$ indicates how much time it takes to finish the whole workload. The execution deadline $t^d$ denotes the time by which all tasks have to be finished. The objective is to determine a schedule such that the total energy cost is minimized and deadline $t^d$ is not exceeded.

5 Performance and energy models

Detailed analysis of the performance and the energy usage of the stencil computations on two unrelated processing units resulted in the following formulation of the performance model. Computation time $t_{u,p,l}^c$ of task u on processor p with state l is estimated as follows:

$$\begin{aligned} O_{u,p}=W_{u,p}*d_{u}^x*d_{u}^y*d_{u}^z \end{aligned}$$

(1)

$$\begin{aligned} B_{u,p}=Q_{u,p}*d_{u}^x*d_{u}^y*d_{u}^z \end{aligned}$$

(2)

$$\begin{aligned} t_{u,p,l}^c=max(O_{u,p}/h_{p,l},B_{u,p}/b_{p,l}) \end{aligned}$$

(3)

where $O_{u,p}$ is the number of arithmetic operations executed and $B_{u,p}$ is the number of bytes transferred.

The energy model assumes that each arithmetic operation as well as the memory operation consumes some energy:

$$\begin{aligned} e_{u,p,l}^c=e_{u,p}^{op}*O_{u,p}+e_{u,p}^{byte}*B_{u,p}+P0_{u,p,l}*t_{u,p,l}^c \end{aligned}$$

(4)

Variables $e_{u,p}^{op}$, $e_{u,p}^{byte}$ approximates the energy usage of stencil operations. For simplicity, it is assumed that arithmetic operations, i.e. additions, multiplications, subtractions and divisions, consume the same amount of energy. Additionally, the energy usage also depends on an instruction set used, thus for the highest performance the CPU implementation of the stencil uses the vector extensions. $P0_{u,p,l}$ is a constant power consumed by the processor $P_{p}$ based on the state l. The coefficients $e_{u,p}^{op},e_{u,p}^{byte}$ and $P0_{u,p,l}$ are approximated with a linear regression. Table 1 shows estimated values of the energy cost for the double precision floating point operation and the transfer of a single byte of data. For CPU and GPU the cost to transfer a single byte of data is 5.2x and 6x more expensive than the floating point operation, respectively. What is more, both floating point and memory operations are 5x more expensive on CPU than on GPU. Figure 2 shows that the constant power grows linearly with the increasing number of cores using different P-states.

Table 1 Energy coefficients for the CPU and GPU architectures

Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures

Abstract

Similar content being viewed by others

Runtime and energy constrained work scheduling for heterogeneous systems

Scalable Parallelization of Stencils Using MODA

Performance Engineering and Energy Efficiency of Building Blocks for Large, Sparse Eigenvalue Computations on Heterogeneous Supercomputers

1 Introduction

2 Related work

3 Stencil properties

4 Problem formulation

5 Performance and energy models

6 Optimal model

6.1 Multiplicity

6.2 Incidence

6.3 Maximum degree and maximum multiplicity

6.4 Chromatic index

6.5 ILP solution

7 Heuristics

7.1 Simple

7.1.1 Balancing load

7.2 Advanced

7.2.1 Minimize degree

7.2.2 Minimize multicut

7.2.3 Accumulate neighbours

8 Experimental studies

8.1 Simulation setup

8.2 Simulation results

8.3 Verification of energy model

9 Conclusions and future work

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation