1 Introduction
D01GBF
available in NAG Numerical Library computes the approximation of the multidimensional integral of a function using Monte Carlo algorithm described in [12] and does not utilize multiple cores of target architectures and its performance is really poor [29].2 Multidimensional Monte Carlo integration
3 Performance analysis
4 Implementation issues
simd
before each loop [27]. To optimize memory access the array, \(\mathbf {v}\) should be allocated using the _mm_malloc()
intrinsic. It works just like the malloc
function and additionally allows data alignment [27]. This loop has limited length (i.e., d); thus, the use of multiple threads cannot be profitable.parallel
construct. Lines 1–4, 5–8, and 9–12, respectively, can be treated as three independent sections that can be run in parallel. Lines 13, 20, 21 can be rewritten as loops annotated with pragma omp for
to be executed in parallel. Such loops can also be vectorized by the compiler using SIMD extensions. The loops from lines 15–17 and 22–24 can also be executed in parallel; however, the variable result should be updated and is shown in Fig. 1.
parallel
construct that launches gangs of workers that will execute in parallel. Gangs may support multiple workers that execute in vector or SIMD mode [18] available in GPUs. The standard also provides several constructs that can be used to specify the scope of data in accelerated parallel regions.
data
construct to specify the scope of data in the accelerated region. The construct parallel loop
is used to vectorize the internal loops 20, 21, 22–24. Note that the variable result
resides in host memory and it is updated using the value of temp
. We also use the OpenACC construct update host
to guarantee that the actual value of the last entry of x
resides in host memory.5 Results of experiments
n
|
\( {d}\)
| Optimal values of \(\beta \) and s
| ||
---|---|---|---|---|
K40m, \(\beta =7744\)
| Xeon Phi, \(\beta =1024\)
| E5-2670, \(\beta =2822\)
| ||
\(1e+06\)
| 4 | 44000 | 16000 | 26561 |
16 | 22000 | 8000 | 13281 | |
64 | 11000 | 4000 | 6640 | |
\(1e+07\)
| 4 | 139140 | 50596 | 83994 |
16 | 69570 | 25298 | 41997 | |
64 | 34785 | 12649 | 20999 |
n
| Continuous | NAG test | ||||||
---|---|---|---|---|---|---|---|---|
Alg.1 (E5) | E5 | Phi | K40m | Alg.1 (E5) | E5 | Phi | K40m | |
\(1e+05\)
| 0.002 | 0.002 | 0.023 | 0.004 | 0.001 | 0.002 | 0.026 | 0.004 |
\(1e+06\)
| 0.017 | 0.005 | 0.030 | 0.008 | 0.012 | 0.008 | 0.035 | 0.012 |
\(1e+07\)
| 0.162 | 0.025 | 0.078 | 0.032 | 0.118 | 0.035 | 0.097 | 0.044 |
\(1e+08\)
| 1.623 | 0.190 | 0.268 | 0.138 | 1.157 | 0.297 | 0.355 | 0.209 |
\(1e+09\)
| 16.921 | 2.483 | 2.031 | 0.765 | 11.060 | 6.136 | 3.798 | 1.601 |
d
|
n
| Corner peak | Product peak | ||||||
---|---|---|---|---|---|---|---|---|---|
Alg.1 (E5) | E5 | Phi | K40m | Alg.1 (E5) | E5 | Phi | K40m | ||
4 |
\(1e+05\)
| 0.008 | 0.013 | 0.034 | 0.006 | 0.005 | 0.003 | 0.024 | 0.006 |
\(1e+06\)
| 0.075 | 0.024 | 0.040 | 0.023 | 0.047 | 0.007 | 0.033 | 0.023 | |
\(1e+07\)
| 0.746 | 0.076 | 0.111 | 0.102 | 0.475 | 0.036 | 0.071 | 0.125 | |
\(1e+08\)
| 7.459 | 0.634 | 0.479 | 0.580 | 5.086 | 0.418 | 0.244 | 0.889 | |
\(1e+09\)
| 74.586 | 6.417 | 4.752 | 4.966 | 46.596 | 6.454 | 2.929 | 8.203 | |
16 |
\(1e+05\)
| 0.015 | 0.025 | 0.032 | 0.008 | 0.013 | 0.014 | 0.027 | 0.009 |
\(1e+06\)
| 0.145 | 0.060 | 0.050 | 0.030 | 0.125 | 0.030 | 0.034 | 0.039 | |
\(1e+07\)
| 1.448 | 0.163 | 0.164 | 0.163 | 1.245 | 0.141 | 0.105 | 0.246 | |
\(1e+08\)
| 14.483 | 1.840 | 1.163 | 1.135 | 12.472 | 1.380 | 0.759 | 1.895 | |
\(1e+09\)
| 144.811 | 13.626 | 10.260 | 9.823 | 124.566 | 12.860 | 8.037 | 17.239 | |
64 |
\(1e+05\)
| 0.043 | 0.034 | 0.042 | 0.016 | 0.041 | 0.034 | 0.036 | 0.019 |
\(1e+06\)
| 0.428 | 0.084 | 0.089 | 0.075 | 0.404 | 0.089 | 0.056 | 0.092 | |
\(1e+07\)
| 4.274 | 0.479 | 0.345 | 0.597 | 4.036 | 0.439 | 0.230 | 0.729 | |
\(1e+08\)
| 42.729 | 5.340 | 3.529 | 4.452 | 40.356 | 5.395 | 3.019 | 6.620 | |
\(1e+09\)
| 427.951 | 50.657 | 30.761 | 38.499 | 404.294 | 70.072 | 27.361 | 65.353 |