Intel Xeon Phi Cores
-
GFLOP/sec = 16 (SP SIMD Lane) × 2 (FMA) × 1.1 (GHz) × 60 (# cores) = 2112 for single-precision arithmetic
-
GFLOP/sec = 8 (DP SIMD Lane) × 2 (FMA) × 1.1 (GHz) × 60 (# cores) = 1056 for double-precision arithmetic
Core Pipeline Stages
µ
code or ucode) ROM is also a source of microcode, which is muxed in with the ucodes generated by the previous decoding stage. The processor reads the general purpose register file at the D2 stage, does the address computation, and looks up the speculative data cache. The decoded instructions are sent down to the execution unit using the U and V pipelines. U is the first path taken by the first instruction in the pair; the second instruction, if pairable under pairing rules that dictate which instruction can pair up with the instructions sent down the U-pipe, is sent down the V-pipe. At this stage, integer instructions are executed in the arithmetic logic units (ALU)s. Once scalar integer instructions reach the writeback (WB) stage, they are done. There is a separate pipeline for x87 floating-point and vector instructions that starts after the core pipeline. When vector instructions reach the WB stage, the core thinks they are done, but they are not done, because the vector unit keeps working on them until they are done at the end of the vector pipeline five cycles later. At this stage they don’t raise any exceptions and will get done.Cache and TLB Structure
Size | 32 kB |
Associativity | 8-way |
Line size | 64 bytes |
Bank size | 8 bytes |
Outstanding misses | 8 |
Data return | Out of order |
Page Size | Entries | Associativity | |
---|---|---|---|
L1 Data TLB | 4 KB | 64 | 4-way |
64 KB | 32 | 4-way | |
2 MB | 8 | 4-way | |
L1 Instruction TLB | 4 KB | 64 | 4-way |
L2 TLB | 4 KB, 64 KB, 2 MB | 64 | 4-way |
madvise
system API to control THP behavior.L2 Cache Structure
Multithreading
Performance Considerations
Address Generation Interlock
add rbx,4
mov rax,[rbx]
[rbx]
is done in a separate stage before the memory fetch happens at line 2 above. In this case, hardware will insert two clock-delays between these two instructions. If running more than one thread, one may not see it as instructions from other threads can run during the dead clock-cycles for another thread.Prefix Decode
Pairing Rules
Probing the Core
Measuring Peak Gflops
pragma vector aligned
” at line number 64 and 72 of Code Listing 4-1. If you look carefully, all threads will be writing back to array a[] in line 65 and 73. This will cause a race condition and will not be useful in real application code. However, for the sake of illustrative purposes, this effect may be ignored at this time.pragma omp parallel for
” will divide up the loop iterations statically in this default case among the available threads set by the environment variable. The pragma vector aligned at line 72 tells the compiler that the arrays a, b, and c are all aligned to the 64-byte boundary for Intel Xeon Phi and do not need to do any special load manipulations needed for unaligned data.38 #include <stdio.h>
39 #include <stdlib.h>
40 #include <omp.h>
41
42 unsigned int const SIZE=16;
43 unsigned int const ITER=48000000;
44
45 extern double elapsedTime (void);
46
47 int main()
48 {
49 double startTime, duration;
50 int i;
51
52 __declspec(aligned(64)) double a[SIZE],b[SIZE],c[SIZE];
53
54
55 //intialize
56 for (i=0; i<SIZE;i++)
57 {
58 c[i]=b[i]=a[i]=(double)rand();
59 }
60
61 //warm up cache
62 #pragma omp parallel for
63 for(i=0; i<ITER;i++) {
64 #pragma vector aligned (a,b,c)
65 a[0:SIZE]=b[0:SIZE]*c[0:SIZE]+a[0:SIZE];
66 }
67
68 startTime = elapsedTime();
69
70 #pragma omp parallel for
71 for(i=0; i<ITER;i++) {
72 #pragma vector aligned (a,b,c)
73 a[0:SIZE]=b[0:SIZE]*c[0:SIZE]+a[0:SIZE];
74 }
75
76 duration = elapsedTime() - startTime;
77
78 double Gflop = 2*SIZE*ITER/1e+9;
79 double Gflops = Gflop/duration;
80
81 printf("Running %d openmp threads\n", omp_get_max_threads());
82 printf("DP GFlops = %f\n", Gflops);
83
84 return 0;
85
86 }
Command_prompt > icpc -O3 -mmic -opt-threads-per-core=2 -no-vec -openmp -vec-report3 dpflops.cpp gettime.cpp -o dpflops.out
gettimeofday
function to get elapsed time in seconds, as shown in the following code segment:#include <sys/time.h>
extern double elapsedTime (void)
{
struct timeval t;
gettimeofday(&t, 0);
return ((double)t.tv_sec + ((double)t.tv_usec / 1000000.0));
}
–mmic
switch dictates to the compiler to generate cross-compiled code for Intel Xeon Phi coprocessor.-opt-threads-per-core=2
switch allows the compiler code generator to schedule code generation assuming 2 threads are running in each core.–no-vec
switch asks the compiler not to vectorize the code, even if it can be vectorized.–openmp
switch allows the compiler to understand the OpenMP pragmas during the compile time and link in appropriate OpenMP libraries.–vec-report3
tells the compiler to print out detailed information about the vectorization being performed on the code as it is being compiled.command_prompt-host >scp ./dpflops.out mic0:/tmp
dpflops.out 100% 19KB 18.6KB/s 00:00
command_prompt-host >
command_prompt-host >scp /opt/intel/composerxe/lib/mic/libiomp5.so mic0:/tmp
command_prompt-host >ssh mic0
LD_LIBRARY_PATH
to /tmp
to be able to find the runtime openmp library loaded to the /tmp directory on the Xeon Phi card. Set the number of threads to 1 by setting the environment variable OMP_NUM_THREADS=1. Also set KMP_AFFINITY=compact, so that OpenMP threads for thread id 0-3 are tied to core 1 and so on.OMP_NUM_THREADS=2
) reaches 1.45 Gflops per core. As the core still uses vector units to perform scalar arithmetic, the code for scalar arithmetic is very inefficient. For each FMA on a DP element, the vector unit has to broadcast the element to all the lanes. Operate on the vector register with a mask and then store the single element back to memory. Note that increasing threads per core does not improve performance, as the instruction can be issued every cycle for this case. Now if you extend to 240 threads utilizing all 60 cores, you can achieve 86 Gflops, as shown in Figure 4-3. Let’s see whether we can get close to the 1 teraflop designed for Intel Xeon Phi by turning on vectorization.
icpc -O3 -opt-threads-per-core=2 -mmic -openmp -vec-report3 dpflops.cpp gettime.cpp -o dpflops.out
dpflops.cpp(58): (col. 29) remark: loop was not vectorized: statement cannot be vectorized.
dpflops.cpp(65): (col. 16) remark: LOOP WAS VECTORIZED.
dpflops.cpp(63): (col. 9) remark: loop was not vectorized: not inner loop.
dpflops.cpp(73): (col. 16) remark: LOOP WAS VECTORIZED.
dpflops.cpp(71): (col. 4) remark: loop was not vectorized: not inner loop.
2 x (for Fused Multiply and Add (FMA)) x 8 (DP elements) x 1.1 GHz = 17.6 Gflops
per core.38 #include <stdio.h>
39 #include <stdlib.h>
40 #include <omp.h>
41
42 unsigned int const SIZE=32;
43 unsigned int const ITER=48000000;
44
45 extern double elapsedTime (void);
46
47 int main()
48 {
49 double startTime, duration;
50 int i;
51
52 __declspec(aligned(64)) float a[SIZE],b[SIZE],c[SIZE];
53
54
55 //intialize
56 for (i=0; i<SIZE;i++)
57 {
58 c[i]=b[i]=a[i]=(double)rand();
59 }
60
61 //warm up cache
62 #pragma omp parallel for
63 for(i=0; i<ITER;i++) {
64 #pragma vector aligned (a,b,c)
65 a[0:SIZE]=b[0:SIZE]*c[0:SIZE]+a[0:SIZE];
66 }
67
68 startTime = elapsedTime();
69
70 #pragma omp parallel for
71 for(i=0; i<ITER;i++) {
72 #pragma vector aligned (a,b,c)
73 a[0:SIZE]=b[0:SIZE]*c[0:SIZE]+a[0:SIZE];
74 }
75
76 duration = elapsedTime() - startTime;
77
78 double Gflop = 2*SIZE*ITER/1e+9;
79 double Gflops = Gflop/duration;
80
81 printf("Running %d openmp threads\n", omp_get_max_threads());
82 printf("SP GFlops = %f\n", Gflops);
83
84 return 0;
85
86 }
Understanding Intel Xeon Phi Cache Performance
http://www.bitmover.com/lmbench/
. I downloaded lmbench3 to do the experiment./bin/sh
if [ "X$CC" != "X" ] && echo "$CC" | grep -q '`'
then
CC=
fi
if [ X$CC = X ]
then CC=cc
for p in `echo $PATH | sed 's/:/ /g'`
do if [ -f $p/gcc ]
then CC="icc –mmic"
fi
done
fi
echo $CC
command_prompt> make –f Makefile.mic
to build the benchmark binaries, which in my case ended up in the folder bin/x86_64-linux-gnu
. The next step was to copy the benchmark binary I was interested in, lat_mem_rd, to the /tmp directory on the coprocessor micro OS using the scp command.command_prompt-mic0 >./lat_mem_rd -P 1 -N 10 32 64
.