1 Introduction
-
We formulate power models that estimate the power consumption as a polynomial function of the frequency or the number of cores. Using these models, we then introduce a methodology for power approximation based on a very reduced set of real power measures.
-
In addition, we propose a systematic procedure to select the VFS–CT configurations that should be tested to improve the precision of the power models. Here, we emphasize that both the methodology for power approximation and the procedure to select the samples in principle carry over to any other application characterized by:
-
fair workload balance across all the cores (this model is not designed for algorithms with limited parallelism or dependencies that lead to idle cores); and
-
iterative nature, where all iterations consume similar amounts of time and energy. This makes it easy to predict the power dissipation for the entire simulation based on a few iterations only.
-
-
We perform a detailed power evaluation of MPDATA on a variety of ARM and Intel architectures, representative of both the low-power and high-performance extremes of the spectrum of current multicore processors.
-
Finally, we validate the precision of our power models using the same collection of multicore CPUs, for both the MPDATA and the CG algorithms.
2 Related work
pmlib
[2] are two frameworks to profile power and track energy consumption of serial and concurrent applications, using wattmeters combined with software microbenchmarks and utilities.3 MPDATA overview and parallelization
3.1 Overview
3.2 Parallelization of MPDATA
GOMP_CPU_AFFINITY=0-5
for the gcc compiler, or OMP_PROC_BIND=true
for OpemMP v.3.1. We can also set the thread affinity directly from the application code (see Listing 1). In our approach, we use the third method as it does not require to set any environment variables outside of the code. In this method, the CPU_ZERO
macro clears the set
variable, the CPU_SET
macro adds the CPU to the set, the syscall(SYS_gettid)
routine returns the thread id for the current thread, and finally the sched_setaffinity
routine sets/migrates the thread tid
to a core specified in the set
.#pragma omp parallel
is used only once, at the beginning of the program, while the directive #pragma omp for
is applied multiple times, once for each stencil.4 Formulation of the power models
4.1 Power models
4.2 Calibration of the power models
5 Verification of the power estimation methodology
5.1 Experimental setup
-
ARM.little: ARMv7 cluster with 4 low-power ARM Cortex-A7 cores (packaged in the Exynos E5422 system-on-chip);
-
ARM.big: ARMv7 cluster with 4 ARM Cortex-A15 cores (also packaged in the Exynos E5422 system-on-chip);
-
Intel.Sandy1: Intel Xeon E5-2620 (SandyBridge) with 6 cores;
-
Intel.Sandy2: Two Intel Xeon E5-2620 CPUs (SandyBridge) with 6 cores each;
-
Intel.Ivy: Intel Core i7-3930K CPU with 6 cores.
Prop. |
ARM.little
|
ARM.big
|
Intel.Sandy1
|
Intel.Sandy2
|
Intel.Ivy
|
---|---|---|---|---|---|
\(S_c\)
| 4 | 4 | 6 | 12 | 6 |
\(S_f\)
| 8 | 9 | 9 | 9 | 15 |
Min. freq. | 0.25 | 0.8 | 1.2 | 1.2 | 1.2 |
Max. freq. | 0.6 | 1.6 | 2.0 | 2.0 | 3.2 |
\(S_f \times S_c\)
| 32 | 36 | 54 | 108 | 90 |
pmlib
library [2]. For the ARM architectures, this tool obtains power data from internal power sensors in the big.LITTLE CPU clusters, Mali GPU, and memory DIMMs. For the Intel servers, pmlib
collects power samples from a PDU, providing real power measures for the full server.5.2 Testing the power modeling approach
#Cores | CPU frequency (GHz) | ||||||||
---|---|---|---|---|---|---|---|---|---|
1.2 | 1.3 | 1.4 | 1.5 | 1.6 | 1.7 | 1.8 | 1.9 | 2.0 | |
1 | 122.0 | 123.0 | 125.0 | 126.4 | 128.1 | 129.7 | 131.8 | 133.6 | 135.1 |
2 | 125.7 | 127.7 | 130.1 | 131.4 | 133.4 | 137.9 | 140.8 | 142.2 | 144.1 |
3 | 129.8 | 131.8 | 136.9 | 138.5 | 140.7 | 143.0 | 145.2 | 148.7 | 150.2 |
4 | 136.4 | 138.5 | 140.0 | 142.9 | 145.4 | 148.6 | 151.3 | 154.1 | 155.9 |
5 | 138.8 | 141.5 | 144.0 | 146.4 | 148.9 | 152.5 | 155.4 | 159.4 | 162.8 |
6 | 141.1 | 144.5 | 146.8 | 149.4 | 153.4 | 157.4 | 160.2 | 165.2 | 169.4 |
7 | 144.3 | 146.0 | 149.3 | 152.9 | 155.7 | 160.1 | 163.9 | 167.0 | 171.4 |
8 | 146.3 | 151.4 | 153.2 | 164.7 | 160.0 | 166.9 | 172.3 | 175.8 | 180.0 |
9 | 148.5 | 152.7 | 160.7 | 162.8 | 165.2 | 170.0 | 175.7 | 179.9 | 186.0 |
10 | 155.6 |
158.8
| 163.2 | 166.3 |
171.7
| 173.7 | 180.5 |
184.2
| 188.4 |
11 | 157.1 | 161.1 | 164.6 | 168.3 | 173.3 | 177.6 | 180.6 | 185.6 | 190.6 |
12 | 158.9 | 163.6 | 167.0 | 170.9 | 173.9 | 179.2 | 184.9 | 190.3 | 195.1 |
1 | 127.7 | 128.3 | 129.3 | 130.6 | 132.3 | 134.2 | 136.4 | 139.0 |
141.9
|
2 | 130.6 | 131.7 | 133.1 | 134.7 | 136.6 | 138.8 | 141.3 | 144.0 | 147.1 |
3 | 133.6 | 135.1 | 136.8 | 138.8 | 141.0 | 143.5 | 146.1 | 149.1 | 152.2 |
4 | 136.6 |
138.5
| 140.6 | 142.9 |
145.4
| 148.1 | 151.0 |
154.1
| 157.4 |
5 | 139.5 | 141.9 | 144.4 | 147.0 | 149.8 | 152.7 | 155.8 | 159.1 | 162.5 |
6 | 142.5 | 145.2 | 148.1 | 151.1 | 154.2 | 157.4 | 160.7 | 164.1 | 167.7 |
7 | 145.5 | 148.6 | 151.9 | 155.2 | 158.6 | 162.0 | 165.5 | 169.2 | 172.8 |
8 | 148.4 | 152.0 | 155.6 | 159.3 | 162.9 | 166.7 | 170.4 | 174.2 | 178.0 |
9 | 151.4 | 155.4 | 159.4 | 163.3 | 167.3 | 171.3 | 175.3 | 179.2 | 183.2 |
10 | 154.4 |
158.8
| 163.1 | 167.4 |
171.7
| 175.9 | 180.1 |
184.2
| 188.3 |
11 | 157.3 | 162.1 | 166.9 | 171.5 | 176.1 | 180.6 | 185.0 | 189.3 | 193.5 |
12 | 160.3 | 165.5 | 170.6 | 175.6 | 180.5 | 185.2 | 189.8 | 194.3 | 198.6 |
6 Conclusions and future work
\(S_m\)
|
ARM.little
|
ARM.big
|
Intel.Sandy1
|
Intel.Sandy2
|
Intel.Ivy
|
---|---|---|---|---|---|
3 | 29.83 | 34.74 | 5.95 | 8.43 | 8.68 |
4 | 6.43 | 4.24 | 0.71 | 1.42 | 1.13 |
5 | 3.95 | 3.45 | 0.70 | 1.42 | 0.97 |
6 | 1.11 | 2.84 | 0.66 | 1.32 | 0.36 |
\(S_m\)
|
ARM.little
|
ARM.big
|
Intel.Sandy1
|
Intel.Sandy2
|
Intel.Ivy
|
---|---|---|---|---|---|
3 | 29.24 | 37.73 | 6.18 | 9.30 | 22.88 |
4 | 6.59 | 4.29 | 1.51 | 2.12 | 9.89 |
5 | 2.79 | 3.63 | 1.30 | 2.04 | 9.59 |
6 | 1.54 | 2.17 | 1.24 | 1.35 | 4.90 |