1 Introduction
2 Stochastic bounds for Markov chains
3 Parallel implementations
3.1 Representation of matrices
val
, sorted left-to-right-then-top-to-bottom by their position in the original matrix; their corresponding column indices in col_ind
; the indices of the start of every row in row_ptr
). The VCSR format is analogous; however, we do not store each nonzero element, but only the first ones in a constant sequence. Moreover, both the investigated operations (S and D) preserve two of the three input arrays (changing only val
).3.2 Loop-level parallelism programming model
#pragma omp parallel for
directives. This lets the OpenMP scheduler choose the default mode for cutting loop iterations in chunks and distribute them on available resources. The user can set the strategy for the scheduler to specify the size of chunks that can be executed statically or dynamically. For our experiments, we use the static
scheduler because it gave us the best results. Parallelizing loops with Intel Cilk Plus is also very simple with the use of the keyword cilk_for
instead of the standard C++ for
keyword giving thus a hint that parallel execution is possible in this loop. Intel Cilk Plus does not give any way to choose a scheduler.3.3 Task-based programming model
#pragma omp task
directives and Intel Cilk Plus cilk_spawn
keyword.cilk_spawn
keyword is used to launch a task represented by a function (or any callable object, like lambdas—as in our implementations) in parallel with the current program.3.4 Details of implementations
N
as the number of rows (and columns) of the matrix, NZ
as the number of nonzero elements of the matrix, and val
, col_ind
, row_ptr
as the pointers to the respective arrays of the CSR/VCSR storage schemes; some of them also use PER_TASK
which is the number of rows processed in one task (and which also determines the number of tasks).4 Details of experiments
-
the
omp-for
implementation using#pragma omp parallel for
from OpenMP standard based on the fork-join paradigm with static scheduler (Listing 1); -
the
cilk-for
implementation using thecilk_for
keyword from Intel Cilk Plus (Listing 2); -
the
omp-task
implementation using#pragma omp task
from OpenMP standard based on the task paradigm (Listing 3); -
the
cilk-spawn
implementation using thecilk_spawn
keyword from Intel Cilk Plus (Listing 4); -
the
cilk-req
implementation using thecilk_spawn
keyword with the recursive manner (Listing 5/6).
CPU | MIC | |
---|---|---|
2 \(\times \) Intel Xeon E5-2670 v.3 | Intel Xeon Phi 7120 | |
(Haswell) | (Knights Corner) | |
# Cores | 24 (12 per socket) | 61 |
# Threads | 48 (2 per core) | 244 (4 per core) |
Clock | 2.30 GHz | 1.24 Ghz |
Level 1 instruction cache | 32 kB per core | 32 kB per core |
Level 1 data cache | 32 kB per core | 32 kB per core |
Level 2 cache | 256 kB per core | 512 kB per core |
Level 3 cache | 30 MB | – |
RAM memory | 128 GB | 16 GB |
Compiler | Intel ICC 16.0.0 | Intel ICC 16.0.0 |
BLAS/LAPACK libraries | MKL 2016.0.109 | MKL 2016.0.109 |
Max. mem. bandwidth | 68 GB/s | 352 GB/s |
SIMD register size | 256 b | 512 b |
icc
) with optimization flag -O3
. All the results are presented for the double
data type and the native mode for MIC. Carrying out the numerical experiments, we were changing the number of available threads. For CPU, we studied the number of the threads from 1 to 24 (the number of the physical cores). For MIC in our tests, we used 60 cores in native mode. In case of native execution model, when the application is started directly on Xeon Phi card, we can use 60 cores (with all 4 threads on each). For MIC, the numbers of threads studied were up to 120—bigger numbers of threads degraded the performance.5 Experimental results
omp_get_wtime()
.5.1 Time
-
the
omp-for
version has the shortest execution time—both on CPU and on MIC; however, it acts strange because it jumps on CPU; -
the
omp-task
version has the longest execution time on CPU; -
the
cilk-spawn
version has the longest execution time on MIC; -
the
omp-for
version is better thancilk-for
—that is, loop-level parallelism is better in OpenMP; -
on the other hand, task-based parallelism is not implemented very efficiently in OpenMP—in comparison with Intel Cilk Plus;
-
CPU runs faster than MIC.
5.2 Speedup
cilk-for
is more stable on CPU than omp-for
which jumps. On MIC, the situation is opposite (omp-for
is more stable). The worst scalability is achieved by cilk-spawn
and cilk-rek
both on CPU and MIC. It seems that the overhead grows with the growth of the number of threads because the scalability is better for fewer threads.6 Conclusion and future works
-
the better results were obtained for CPU than Intel Xeon Phi;
-
the use of
#pragma omp parallel for
from the OpenMP standard performed better than thecilk_for
constructs from Cilk Plus; -
the use of
#pragma omp task
from the OpenMP standard performed worse than thecilk_spawn
constructs from Cilk Plus; -
the best results were obtained for
#pragma omp parallel for
from the OpenMP standard; -
the poorest results were achieved for
#pragma omp task
from the OpenMP standard.
#pragma omp parallel for
. While the Intel Cilk Plus standard offers similar functionality, in our applications we were not able to achieve satisfactory performance results with Cilk Plus despite investing a greater development effort than we did with OpenMP. The results depend on the number of nonzeros elements in sparse matrices for Intel Cilk Plus.