1 Introduction
2 Related Work and Motivation
vector_count
vectors on a hybrid Xeon/Xeon Phi architecture. Each vector has dim_size
elements in it. For the test purposes, the similarity measure used in experiments was a square root of the sum of differences of values on corresponding dimensions power 2. Additionally, for the sake of testing load balancing in the case computations of various pairs of vectors take various amounts of time, finding pairs for which similarity exceeds a threshold is also considered. Successive vector elements are assigned values in the [1, 2] range.3 Testbed Implementation, Optimization and Testing
3.1 Initial Implementation, Optimization and Tests on Intel Xeon Phi
#pragma omp parallel
block in which each thread:-
fetches its id,
-
goes through a loop iterating through
first_vector
—index of a first vector other vectors from a batch would be compared to, -
goes through a loop iterating through batches of vectors; each batch has a predefined size (that in case of one version decreases in time); each thread selects its own batch based on the thread id,
-
each thread compares the current
first_vector
to each of the vectors in the batch and stores similarity measures for each pair,
-
m1
—the presented basic version following the concept presented in [11]. Separate, dynamically allocated arrays with alignment are used for reading input vectors and storage of results. -
m2
—in this version of the code an additional array is used so that each iteration of the loop through vectors in a second batch are independent and results are stored in temporary space for each vector being compared to. -
m21
—this is them2
version modified in such a way that initialization of the temporary result space are performed first for each vector of the second batch, then computations of similarities are performed for each vector of the second batch and finally copying of whole results for a batch is performed. -
m21-vbs
—them21
version with a decreasing batch size. The algorithm is as follows:×
dim_size = 10
,
000
, balanced affinity, batch_size = 2500
as shown in Table 1. For m21-vbs
78 was used as batch_size_threshold
.dim_size = 10,000
, balanced affinity, batch_size = 2500
Version of code | Execution time (s) |
---|---|
m1 | 32.59 |
m2 | 32.08 |
m21 | 31.31 |
m21-vbs | 31.12 |
m21-vbs
for various sizes of a batch for the number of vectors equal to 10,000 and the dimension size equal to 10,000. It can be seen that the best results are obtained for the batch size equal to 2500.
batch_size_threshold
were performed. Figure 2 shows execution times for the best batch size 2500 and various lowest batch size limits. It can be seen that execution times are the lowest for middle values of the batch size limit. The value of 78 was then used in subsequent tests as it gives smallest execution times.
3.2 Implementation, Optimization and Tests on a Hybrid Intel Xeon/Xeon Phi System
m21-vbs
version were improved with the following modifications:-
In order to increase computation/communication ratio of processing a batch, batches for the “first vector” are considered to which vectors of the previously mentioned batch is compared (this is now called second vector batch). Specifically, the following pseudocode illustrates which vectors are compared to which ones in this case: It should be noted that while vectors of the “first vector” batch are not compared, the code generates second vector batches (equivalent to batches in the first version considered in Sect. 3.1) starting from after the first vector of the first vector batch so that all pairs of vectors are finally compared.×
-
The code was designed to make the most of all cores of Xeon and Xeon Phis with some cores used by managing and other cores by computing threads. At the highest level the number of threads equal to the number of (1 \(+\) number of Xeon Phis used) is launched which fetch successive batches (dynamic assignment of batches to idle threads managing CPUs or Xeon Phis) and manage computations in lower level parallel regions (nested parallelism enabled): using a
#pragma omp parallel
block for the host CPUs and#pragma offload
and next#pragma omp parallel
for Xeon Phi cards. It should be noted thatout
clauses in the offloads for two Xeon Phis copy results to distinct arrays. Apart from the managing host threads, the#pragma omp parallel
constructs usenum_threads(x)
clauses to specify the number of computing threads used for CPUs or a Xeon Phi. On the host the best number of computing threads launched turned out to occupy (<the number of physical cores> − <the number of Xeon Phi cards used>) physical cores. For each physical core 2 computing threads were launched. On the Xeon Phi in the offload mode the best number of threads launched turned out to be 236. In the hybrid CPU \(+\) Xeon Phi tests, the following numbers of computing threads were used:For CPU only scalability tests, up to 40 computing threads were used. Within the blocks for-
CPUs \(+\) 1 Xeon Phi − 38 threads for CPUs, 236 threads for Xeon Phi,
-
CPUs \(+\) 2 Xeon Phis − 36 threads for CPUs, 236 threads for each Xeon Phi.
#pragma omp parallel
constructs the same parallelization strategy was used for CPUs and Xeon Phis i.e.#pragma omp for schedule(dynamic)
for loop parallelization with dynamic assignment of iterations to threads. -
-
Various values of the first vector batch and the second vector batch influence load balance but also synchronization overheads thus proper configurations need to be obtained for best execution times.
-
Loop tiling when comparing a given first vector to a second vector. In this case, the loop going through dimensions of the second vector was partitioned into batches of 1024 iterations.
-
Optimization using overlapping communication and computations in a hybrid environment [12].
export MIC_ENV_PREFIX = PHI
, export PHI_KMP_AFFINITY = granularity = fine,compact
, export MIC_USE_ 2MB_BUFFERS = 64k
. It can be seen very clearly that there is an area of best settings with the second vector batch size around 100 and the first vector batch size is around 1000–2000 for best execution times. Either smaller or larger settings result in higher overall execution times. This is because smaller values result in larger overhead to the number of packets that need to be handled while larger sizes do not allow good balancing. Tests have revealed that these settings are also appropriate for several other combinations of the total number of vectors and dimension sizes tested next. These are different values than those obtained for the initial version discussed in Sect. 3.1 due to consideration of first vector batches in the improved version. However, tests have revealed that the value batch_size_threshold = 78
is still suitable for the improved version as evidenced by results shown in Fig. 8. Various settings of batch_size_threshold
gave marginally different execution times though.
MIC_USE_2MB_BUFFERS = 64k
are shown in Figure 7. It can be seen clearly that, while affinity settings do not impact performance much (note that 236 threads per Xeon Phi are used here), setting MIC_USE_2MB_BUFFERS = 64k
brings a noticeable gain in performance.
num_threads(x)
clause to occupy the given number of logical processors on Xeons or a Xeon Phi. Figures 9 and 10 present execution times and speed-ups for these configurations for 10,000 vectors with the dimension size of 10,000. It can be noted that in work [25] parallel execution of the BiCGSTAB solver gave very similar speed-ups of up to around 90 for larger data sizes. In our case performance of the host CPU and the Xeon Phi were very similar. Peak double precision performance of Intel Xeon Phi 5000 series is given at just over 1 TFlop/s and is higher than theoretical performance of the CPUs used. However, real results may be different and may depend on how efficiently resources may be used by a given implementation of a specific problem. Especially in the case of Xeon Phi, speed-up limitations and memory bottlenecks may affect performance. For instance, Fig. 10 shows that for the application tested parallel efficiency obtained for the host CPUs (speed-up of almost 26 for a total of 40 logical processors) compared to that for the Xeon Phi (speed-up of 95 for a total of 240 logical processors) contributes to similar execution times for the two. Power requirements for the CPUs are \(2\cdot 115\)W which is slightly higher than given for Intel Xeon Phi 5110P at 225W which might be a relative benefit for the latter.