1 Introduction
-
We analyze auto-vectorized implementations of the matrix profile, and find that this approach is not able to fully exploit the underneath SIMD capabilities of Intel CPUs.
-
We develop
SCRIMPvect
andSCAMPvect
, two custom-vectorized versions of SCRIMP and SCAMP, exploiting AVX2 and AVX-512 vector extensions. -
We detect a problem with masking vectorization instructions on AVX2, and propose a solution to tackle it by using AVX-512 instructions with 256-bit registers.
-
We perform a thread affinity and load balance exploration.
-
We analyze the roofline of the algorithms before and after vectorization.
-
We perform an analysis of
SCRIMPvect
andSCAMPvect
in terms of performance and find that our approach shows an improvement of more than 4\(\times\) with respect to the auto-vectorization.
2 Motivation
3 Background
3.1 Time series analysis. The matrix profile
3.1.1 Parallelization
3.2 Vectorization extensions
3.2.1 AVX
3.2.2 AVX2
vpmaskmov
for conditional SIMD integer packed loads and stores [22]).3.2.3 AVX-512
4 SCRIMPvect
and SCAMPvect
4.1 Vectorization approach
4.2 Implementations
LOADU_PD
) packs of elements from the time series and the precomputed statistics arrays, belonging to the VPACK subsequences that we want to compare with the first subsequence in terms of distance. The time series values of the first subsequence have to be replicated in the vector (SET1_PD
) to do so:CMP_PD
) of the obtained correlations with those already present in P are done in parallel when updating P vertically (see Fig. 7). However, when updating P horizontally we should compute the horizontal maximum of the vector containing the correlations (correlation_v
) and then update the P value if applicable. Conversely, we opted for doing this sequentially (see Sect. 4.3 for more information).CMPT_GT_OQ
to CMPT_LT_OQ
in the comparison instruction CMP_PD
.4.3 Challenges
_CMP_LT_OQ
for the comparisons. O, for Ordered, causes false to be returned if either element in the comparison is NaN. And Q, for Quiet, causes no signal to be thrown if there is a division by 0 or any error in the comparison.correlation_v
in Fig. 12). Whereas this problem can be solved by means of maxing the vector with a shuffled version of itself until we obtain the maximum value replicated throughout the whole vector, the problem is that we have to obtain the index of the subsequence giving such a maximum. That is not a trivial task, so we decided to do it sequentially.5 Experimental evaluation
5.1 Experimental setup and methodology
-mavx2
and -mav512f
flags, respectively, and -mfma
to enable fused-multiply-add instructions. Furthermore, we have experimented with AVX-512 instructions using 256-bit YMM registers, for which we use the -mav512vl
flag. All experiments are executed 10 times and the time results are averaged. Thread affinity is left in charge of the operating system unless otherwise mentioned. The operating system is Ubuntu with Linux kernel 4.15. A study on the effect of thread affinity can be found in Sect. 5.4.Time series | n | m | Max | Min |
---|---|---|---|---|
Audio | 20.234 | 200 (2 s) | 6.69 | −56.48 |
ECG | 180.000 | 500 (2 s) | 2.6 | 0.325 |
Power | 180.000 | 1.325 (8 h) | 140 | 0 |
Seismology | 180.000 | 50 (2.5 s) | 696.1 | \(-\)185.7 |
Human Act | 6.997 | 120 (12 s) | 2.51 | −2.9 |
Peng. Behav | 109.842 | 800 | 0.52 | −0.21 |
Time series | n | m | Max | Min |
---|---|---|---|---|
ECG | 1.800.000 | 500 (2 s) | 3.39 | −1635 |
Power | 1.859.587 | 1.325 (8 h) | 140 | 0 |
Seismology | 1.728.000 | 50 (2.5 s) | 2329.3 | −2334.6 |
5.2 Results
Impl. | Series | n (K) | m | # dTLB store misses | ||
---|---|---|---|---|---|---|
#th = 2 | #th = 16 | #th = 32 | ||||
AVX2 | 9.320.502.463 | 5.209.871.973 | 24.371.203 | |||
AVX-512vl | Power | 180 | 1.325 | 3.903.164.142 | 1.949.028.376 | 9.241.465 |
AVX-512 | 1.874.597.618 | 744.614.596 | 1.868.692 | |||
AVX2 | 144.912.286 | 60.832.193 | 13.134.275 | |||
AVX-512vl | Seismo | 180 | 50 | 57.375.614 | 44.190.128 | 6.439.342 |
AVX-512 | 40.623.017 | 31.620.232 | 2.220.071 |
5.3 Masking overhead
MASKSTOREU
is a source of high overhead. We use _mm256_maskstore
instrinsics for the AVX2 implementation (see Fig. 8) and _mm256_mask_storeu
intrinsics for the AVX-512 (see Fig. 9), which in turn compile to vmaskmov
/vpmaskmov
and vmovup mask
instructions, respectively.vmaskmov
and vblendv
) which can degrade the performance in certain cases. The Intel 64 and IA-32 Architectures Optimization Reference Manual [26] recommends using vmaskmov
only in cases where vmovup
cannot be used. Furthermore, it states the following: “32-byte AVX store instructions that span two pages require an assist that costs roughly 150 cycles”. Assists are microcode invoked by hardware to help handle certain events. We use perf
to retrieve the number of these assists triggered by our implementations to get the data in Table 4. We use other_assists.any
which gets the “Number of times a microcode assist is invoked by hardware other than FP-assist. Examples include AD (page Access Dirty) and AVX* related assists”, as stated by perf list
. We can see that the data is highly correlated with that in Table 3. Although the assists are not broken down, we might say by the figures in the table that AVX2’s vmaskmov
implementation might trigger assists even with 0-masked addresses, which may set A and D bits and might update the TLB. The Intel’s ISA Reference Manual [22] states for vmaskmov
that, in cases where mask bits indicate data should not be loaded or stored, paging A and D bits will be set in an implementation dependent way.Impl. | Series | n (K) | m | # Other assists | ||
---|---|---|---|---|---|---|
#th = 2 | #th = 16 | #th = 32 | ||||
AVX2 | 2.284.759.397 | 2.299.158.617 | 10.163.170 | |||
AVX-512vl | Power | 180 | 1.325 | 2.065.544.112 | 1.835.219.679 | 7.677.854 |
AVX-512 | 1.059.781.425 | 638.604.656 | 25.799 | |||
AVX2 | 54.138.705 | 31.766.931 | 2.284.644 | |||
AVX-512vl | Seismo | 180 | 50 | 35.562.400 | 33.353.984 | 2.127.670 |
AVX-512 | 21.002.933 | 26.417.715 | 25.973 |
vmaskmov
and vblend
by vmov mask
instructions. The performance improvement is noticeable by the numbers in Tables 3 and 4, and the performance gains over the AVX2 implementation in Figs. 14 and 15, especially for large window series up to 16 threads.
5.4 Thread affinity exploration
OMP_PLACES=“0,1:64:2”
, so that 64 places are defined with 2 threads each. This way, each place corresponds to a physical core with 2 hyperthreads.#pragma omp for
loops). This effect is more noticeable with shorter series (Audio and Human). For 64 threads, the nobind policy is able to yield slightly better results than the close policy, as the operating system have the ability to migrate threads to idle cores, just in case there is a load from system processes. For 128 threads the results are obviously the same whatever the policy.5.5 Load balance exploration
schedule(dynamic)
OpenMP clause.schedule(static)
compared with the executions with schedule(dynamic)
. Lower is worse for the static versions. We can observe that a static partition of (meta-)diagonals is not a good idea for these algorithms. The worst outcome is yielded by the vectorized algorithms, as expected. No-Vect gets affected as well, but to a lower extent overall. This is because thread’s work unit is a diagonal, which is finer grain than a meta-diagonal.5.6 Large series results
5.7 Roofline after vectorization
means[0]
(see Fig. 11) is now done VPACK times, whereas SCRIMP does not need such a subtraction for the dot product computation.