Xeon Phi Vector Microarchitecture
The VPU Pipeline
-
Double-precision (DP) pipeline: Used to execute float64 arithmetic, conversion from float64 to float32, and DP-compare instructions.
-
Single-precision(SP) pipeline: Executes most of the instructions including 64-bit integer loads. This includes float32/int32 arithmetic and logical operations, shuffle/broadcast, loads including
loadunpack
, type conversions from float32/int32 pipelines, extended math unit (EMU) transcendental instructions, int64 loads, int64/float64 logical, and other instructions. -
Mask pipeline: Executes mask instructions with one-cycle latencies.
-
Store pipeline: Executes the vector store operations.
-
Scatter/gather pipeline: Executes the vector register read/writes from sparse memory locations.
VPU Instruction Stalls
Pairing Rule
Type | Instruction Mnemonics |
---|---|
Vector mask instructions | JKNZ, JKZ, KAND, KANDN, KANDNR, KCONCATH, KCONCATL, KEXTRACT, KMERGE2L1H, KMERGE2L1L, KMOV, KNOT, KOR, KORTEST, KXNOR, KXOR |
Vector store instructions | VMOVAPD, VMOVAPS, VMOVDQA32, VMOVDQA64, VMOVGPS, VMOVPGPS |
Vector packstore instructions | VPACKSTOREHD, VPACKSTOREHPD, VPACKSTOREHPS, VPACKSTOREHQ, VPACKSTORELD, VPACKSTORELPD, VPACKSTORELPS, VPACKSTORELQ, VPACKSTOREHGPS, VPACKSTORELGPS |
Vector prefetch instructions | VPREFETCH0, VPREFETCH1, VPREFETCH2, VPREFETCHE0, VPREFETCHE1, VPREFETCHE2, VPREFETCHENTA, VPREFETCHNTA |
Scalar instructions | CLEVICT0, CLEVICT1, BITINTERLEAVE11, BITINTERLEAVE21, TZCNT, TZCNTI, LZCNT, LZCNTI, POPCNT, QUADMASK |
Vector Registers
zmm0-zmm31
), eight 16-bit mask registers (K0-K7), and the status register (MXCSR), as shown in Figure 3-2.
Vector Mask Registers
v2
and v3
are added and, depending on the mask register bit values, only the V1
register element, which corresponds to 1 bit in the k1
register, gets updated. The other values corresponding to bit values 0 remain unchanged, unlike implementations where these elements can get cleared. For some operations, such as the vector blend operation (VBLEND*
), the mask can be used to select the element from one of the operands to be output.0xFFFF
is implied. There is a special mask register designated k0
, which represents the default value of 0xFFFF
and is not allowed to be specified as a write mask, because it is implied when no mask register is used. Although this mask register cannot be used for a write mask, it can be used for other mask register purposes, such as holding carry bits from integer arithmetic vector operations, comparison results, and so forth. One way to remember this restriction is to recall that any mask registers specified inside a braces {} mask specifier cannot be k0
, but they can be used in other locations.
Extended Math Unit
instructions | Latency (cycles) | Throughput (cycles) |
---|---|---|
Exp2 | 8 | 2 |
Log2 | 4 | 1 |
Recip | 4 | 1 |
Rsqrt | 4 | 1 |
Power | 16 | 4 |
Sqrt | 8 | 2 |
Div | 8 | 2 |
Xeon Phi Vector Instruction Set Architecture
Data Types
-
Packed 32-bit integers (or dword)
-
Packed 32-bit single-precision FP values
-
Packed 64-bit integers (or qword)
-
Packed 64-bit double-precision FP values
Memory Stored Data Type | Destination Register Data Type | |||
---|---|---|---|---|
float32 | float64 | int32/uint32 | int64/uint64 | |
float16 | Yes | No | No | No |
float32 | Yes | Yes | Yes | No |
sint8 | Yes | No | Yes | No |
uint8 | Yes | No | Yes | No |
int16 | Yes | No | Yes | No |
uint16 | Yes | No | Yes | No |
int32 | Yes | Yes | Yes | No |
uint32 | Yes | Yes | Yes | No |
float64 | Yes | Yes | Yes | No |
int64/uint64 | No | No | No | Yes |
Vector Nomenclature
Vector Instruction Syntax
vop v0{mask}, v1, v2|mem {swizzle}
vop
indicates vector operator; v0,v1,v2
various vector registers defined in the ISA; mem
is a memory pointer; {mask}
indicates an optional masking operation; and {swizzle}
indicates an optional data element permutation or broadcast operation. The mask
and swizzle
operations are covered in detail later in this chapter.v0
in the above syntax is also the output operand of an instruction. The output may be masked with an optional mask
, and the input operand may be modified by swizzle
operations.v0 <= vop (v1|mem)
vop
is the vector operator; v1
are vector registers; and mem
represents a memory reference. The memory reference conforms to standard Intel-64 ISA and can be direct or indirect addressing—with offset, scale, and other modifiers to calculate the address.vcvtpu2ps
, which instructs a vector (vcvt
part of the instruction mnemonics) of unsigned integers (pu
) to convert to a (2
) vector of floats (ps
).v0 <= vop (v1, v2|mem)
,vop
operates on input v1
and v2
or mem
and writes the output to v0
. The swizzle/broadcast modifiers may be added to v2/mem
, and the mask operator can be used to select the output of the vector operation to update the v0
register.vaddps
, an instruction to add two vectors of floating point data.v0 <= vop (v0,v1,v2/mem)
,v0
, v1
, and v2
and writes the result of operation to one of the registers v0
.Xeon Phi Vector ISA by Categories
Mask Operations
vop v1[{k1}], v2, v3|mem
k1
determine which elements of the vector v1
will be written to by this operation. For 64-bit data types, the last eight bits of the mask register are used as mask bits. Here k1
is working as a write mask. If the mask bit corresponding to an element is zero, the corresponding element of v1
will remain unchanged; otherwise it will be overwritten by the corresponding element of the output of the computation. The square bracket indicates optional arguments.k0
has all bits 1. This is a default mask register for all the instructions that do not have their mask specified. The behavior of mask operations was described in the section of this chapter, “Vector Registers.”Swizzle, Shuffle, Broadcast, and Convert Instructions
Swizzle
vectorop v0, v1, v2/mem{swizzle},
v0
, v1
, and v2
represents vector registers.{dcba}
denotes the 32-bit elements that form one 128-bit block in the source (with a being least significant and d being most significant). {aaaa}
means that the least significant element of each lane of a source register with shuffle modifier is replicated to all four elements of the same lane. When the source is a register, this functionality is the same for both integer and floating-point instructions. The first few swap patterns in the table ({cdab}
, {badc}
, {dacb}
) are used to shuffle elements within a lane for arithmetic manipulations such as cross-product, horizontal add, and so forth. The last four patterns’ “repeat element” are useful in many operations, such as scalar-vector arithmetic.Function: 4 × 32 bits/4 × 64 Bits | Usage {swizzle}
|
---|---|
No swizzle | No swizzle modifier (default) or {dcba}
|
Swap inner pairs | {cdab}
|
Swap with two away | {badc}
|
Cross product swizzle | {dacb}
|
Broadcast ‘a’ element across 4-element packets | {aaaa}
|
Broadcast ‘b’ element across 4-element packets | {bbbb}
|
Broadcast ‘c’ element across 4-element packets | {cccc}
|
Broadcast ‘d’ element across 4-element packets | {dddd}
|
vorpi v0,v1,v2{aaaa}
. It shows the replication of the least significant element across all lanes due to the swizzle operation {aaaa}
. Intel compiler intrinsics define __MM_SWIZZLE_ENUM
to express these permutations.
Register Memory Swizzle
Data Broadcasts
{1 to 16}
, {4 to 16}
, and {16 to 16}
:
-
In
{1 to 16}
broadcast swizzle pattern, one 32-bit element pointed to by the memory pointer is read from memory and replicated 15 times, which together with the single element read in from memory creates 16 element entries. -
For
{4 to 16}
broadcast, the first four elements pointed to by the memory pointer are read from memory and replicated three more times to create 16 element entries. -
{16 by 16}
broadcasts are implied when no conversions are specified on memory reads and all 16 elements load from the memory into the registers. {16 by 16} is the default pattern in which no replication happens.
Data Conversions
Shuffles
dddc
, whereas swizzle cannot.vpshufd zmm1{k}, zmm2/mem, imm8
zmm2
using index bits in imm8
. The results are written to zmm1
after applying appropriate masking using mask bits in k
.vpermf32x4 zmm1{k}, zmm2/mem, imm8
__MM_SWIZZLE_ENUM
to express these permutations.Shift Operation
Logical Shifts
vpslld zmm1{k1}, Si32(zmm2/mem), imm8
imm8
value to perform the shift. This instruction performs an element-by-element logical shift of the result of the swizzle, broadcast, or conversion of the input data zmm2/mem
(indicated by the Si32
operation in the instruction mnemonic) by shift count given by the immediate value imm8
and stores the result in zmm1
using write mask {k1}
. If the shift count is more than 31, the result is set to all zeros. The write mask dictates which of the elements of the output registers will be written to. The elements in the destination register zmm1
, for which the corresponding bits in {k1}
are clear, retain their original values.vpsrld
, where the r
replacing l
in vpslld
indicates the right shift operation. The logical shift right shifts a 0-bit in the MSB for each shift count. Similar to the left shift, if the number of shifts is greater than 31, all bits are set to zero.vpsllvd zmm1{k1}, zmm2, Si32(zmm3/mem)
Si32
(zmm3/meme
) to indicate the desired shift amount. This instruction performs an element-by-element logical shift of the 32-bit integer vector zmm2
by the int32
data computed by the swizzle, broadcast, or conversion of the zmm3/mem
and stores the result in zmm1
using write mask {k1}
. If the shift count is more than 31, the result is set to all zeros. The write mask dictates which of the elements of the output registers will be written to. The elements in the destination register zmm1
, for which the corresponding bits in {k1}
are clear, retain their original values.vpsrlvd
, where the r
replacing l
in vpsllvd
indicates the right shift operations. The logical shift right shifts a 0 bit in the MSB for each shift count. Similar to left shift, if the number of shifts is greater than 31, all bits are set to zero.Arithmetic Shifts
vpsrad zmm1{k1}, Si32(zmm2/mem), imm8
imm8
value to perform the shift. This instruction performs an element-by-element arithmetic right shift of the result of the swizzle, broadcast, or conversion of the input data zmm2/mem
(indicated by the Si32
operation in the instruction mnemonic) by the shift count given by the immediate value imm8
and stores the result in zmm1
using write mask {k1}
. The arithmetic shift keeps the sign bit unchanged after each shift count and shifts the results into MSB bits. If the shift count is more than 31, the result is set to the original sign bit for all destination elements. The write mask dictates which of the elements of the output registers will be written to. The elements in the destination register zmm1
, for which the corresponding bits in {k1}
are clear, retain their original values.vpsravd zmm1{k1}, zmm2, Si32(zmm3/mem)
Si32
(zmm3/mem
) to indicate desired arithmetic right shift count. This instruction performs an element-by-element arithmetic right shift of the 32-bit integer vector zmm2
by the int32
data computed by the swizzle, broadcast, or conversion of the zmm3/mem
and stores the results in zmm1
using write mask {k1}
. The arithmetic shift keeps the sign bit unchanged after each shift count and shifts the results into MSB bits. If the shift count is more than 31, the result is set to the original sign bit for all destination elements. The write mask dictates which of the elements of the output registers will be written to. The elements in the destination register zmm1
, for which the corresponding bits in {k1}
are clear, retain their original values.Sample Code for Swizzle and Shuffle Instructions
Is32vec16
provided with the Intel Xeon Phi compiler. In order to build and run this code, I went through the steps shown in the top of Listing 3-1. The middle section of Listing 3-1 shows the source code to test the shuffle instruction behavior using the C++ vector library and compiler intrinsics (these are functions you can call from C++ routines, which usually map into one assembly instruction).cdab
swaps inner pairs of each lane; the intralane shuffle with the pattern aaaa
on the same input data replicates element A
to each element of each lane; and, finally, the interlane shuffle with the data pattern aabc
reorganizes the lanes.shuftest.cpp
on Intel Xeon Phi//compiled the code
icpc -mmic shuftest.cpp -o shuftest
//copied output to Xeon Phi
scp shuftest mic0:/tmp
//Executed the binary
ssh mic0 "/tmp/shuftest"
shuftest.cpp
//-------------------------
//-- Program shuftest.cpp
//-- Author: Reza Rahman
////-------------------------
#define MICVEC_DEFINE_OUTPUT_OPERATORS
#include <iostream>
#include <micvec.h>
int main()
{
_MM_PERM_ENUM p32;
__declspec(align(64)) Is32vec16 inputData(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
__declspec(align(64)) Is32vec16 outputData;
std::cout << "input = " << inputData;
// swizzle input data and print
//
std::cout << "\nswizzle data for pattern 'cdab' \n" << inputData.cdab();
// swizzle input data and print
std::cout << "\n Intra lane shuffle data for pattern 'aaaa' \n";
p32 = _MM_PERM_AAAA;
//shuffle intra lane data
outputData = Is32vec16(_mm512_shuffle_epi32(__m512i(inputData), p32));
std::cout << outputData << "\n";
std::cout << " Inter lane shuffle data for pattern 'aabc' \n";
p32 = _MM_PERM_AABC;
//shuffle inter lane data
outputData = Is32vec16(_mm512_permute4f128_epi32(__m512i(inputData), p32));
std::cout << outputData << "\n";
}
shuftest.cpp
run on Intel Xeon Phiinput = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0}
swizzle data for pattern 'cdab'
{14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1}
Intra lane shuffle data for pattern 'aaaa'
{15, 15, 15, 15, 11, 11, 11, 11, 7, 7, 7, 7, 3, 3, 3, 3}
Inter lane shuffle data for pattern 'aabc'
{7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12, 15, 14, 13, 12}
Arithmetic and Logic Operations
V*PS
for SP arithmetic, V*PD
for DP arithmetic, VP*D
for int32, and VP*Q
for int64. These instructions include nine MAX/MIN instructions: V*MAX*
, V*MIN*
. There are four hardware-implemented (EMU) transcendental instructions (VEXP223PS
, VLOG2PS
, VRCP23PS
, and VRSQRT23PS
). The hardware supports SP and DP floating point denorms, and there is no performance penalty working on the denorms. So it does not assert DAZ (denormals are zero) and FTZ (flush to zero support). For logical operations, the ISA contains seven compare instructions—V*CMP*
—which compare vector elements and set the vector masks. There are also 15 Boolean instructions to implement logical operations.Fused Multiply-Add
v1 = v1 vop1 v2 vop2 v3
,vop1
can be set to multiply operation (×) and vop2
set to addition operation (+).v3
. In order to simplify the coding effort for the programmers, the ISA contains a series of FMA/FMS operations that can be numbered with three digits to signify the association of the source vectors with specific operations without remembering the rules that allow for specific sources to be tied to specific modifiers. For example, a basic SP vector FMA can have three mnemonics associated with it that are interpreted based on the three digits embedded in mnemonics, as follows:vfmadd132ps v1,v2,v3 => v1 = v1xv3 + v2
vfmadd213ps v1,v2,v3 => v1 = v2xv1 + v3
vfmadd231ps v1,v2,v3 => v1 = v2xv3 + v1
v3
, but by using the various mnemonics programmers it can apply the shuffle operators to the appropriate source vector of interest.vfmadd233ps
, allows you to do a scale-and-bias transformation in one instruction, which could be useful in image-processing applications. This instruction uses four element sets—0–3, 4–7, 8–11, and 12–15—of source vector v2
and uses v3
elements as scale and bias to generate the results in vector v1
; vfmadd233ps v1,v2,v3
generates code equivalent to the following:v1[3..0] = v2[3..0] x v3[1] + v3[0]
v1[7..4] = v2[7..4] x v3[5] + v3[4]
v1[11..8] = v2[11..8] x v3[9] + v3[8]
v1[15..12] = v2[15..12] x v3[13] + v3[12]
Data Access Operations (Load, Store, Prefetch, and Gather/Scatter)
V*LOADUNPACK*
, V*PACKSTORE*
operations—and 19 scatter or gather instructions implement the semantics for the various scatter or gather operations that are required by the many technical computing applications supported by this ISA. The mnemonics for these instructions are V*GATHER*
, V*SCATTER*
. In addition, the ISA supports eight consecutive memory prefetch instructions V*PREFETCH*
and six scattered memory gather or scatter prefetch instructions to help prefetch data reach various cache levels and to reduce data access latency when needed.Memory Alignment
#GP
(General Protection) fault will occur. The alignment requirement is dictated by the number of data elements and the type of the data element. For example, if a vector operation needs to access 16 elements of four-byte (32-bit) SP floats, the referenced data elements must be 16x4=64
[number of elements times the size of (float)] byte aligned. The Intel Xeon Phi memory alignment rules for vector operations are shown in Table 3-5.Memory Storage Form | Number of Load/Store Elements | Needed Memory Alignment (bytes) |
---|---|---|
4 bytes (float, int32, uint32) | 1 (1 to 16 broadcast) | 4 |
4 (4 to 16) | 16 | |
16 (16 to 16) | 64 | |
2 bytes (float16, sint16, uint16) | 1 (1 to 16 broadcast) | 2 |
4 (4 to 16) | 8 | |
16 (16 to 16) | 32 | |
1 byte (sint8, uint8) | 1 (1 to 16 broadcast) | 1 |
4 (4 to 16) | 4 | |
16 (16 to 16) | 16 |
Pack/Unpack
vloadunpackh*/vloadunpackl
* instruction pairs. These instructions allow you to relax the memory alignment requirements by requiring alignment to the memory storage form only. As long as the address to load from is aligned to a boundary for memory storage form, then executing a pair of vloadunpackl*
and vloadunpackh*
will load all 16 elements with default mask.Non-temporal data
Streaming Stores
VMOVNRAP*
, VMOVNRNGOAP*
, allow you to indicate that the data need to be written without reading the data first. In Xeon Phi the VMOVNRAPS/VMOVNRAPD
instructions are able to optimize the memory BW in case of a cache miss by not going through the unnecessary read step.VMOVNRNGOAP*
instructions are useful when the programmer tolerates weak write-ordering of the application data—that is, the stores performed by these instructions are not globally ordered. A memory-fencing operation should be used in conjunction with this operation if multiple threads are reading and writing to the same location.1
Scatter/Gather
vgatherd*
to gather SP, DP, int32, or int64 elements in a vector register using signed dword indices for source elements. These instructions can gather up to 16 32-bit elements in up to 16 different cache lines. The number of elements gathered will depend on the number of bits set in the mask register provided as source to the instruction.Prefetch Instructions
vprefetch*
instructions are implemented for performing these operations. Intel Xeon Phi also implements gather prefetch instructions vgatherpf*
. These instructions are critical where the hardware prefetch is not able to bring in necessary data to the cache lines, causing cache-line misses and hence increased instruction retirement stalls. Prefetch instruction can be used to fetch data to L1 or L2 cache lines. Since this hardware implements an inclusive cache mechanism, L1 data prefetched are also present in L2 cache—but not vice versa.