Intel Xeon Phi Execution Models
-
Offload execution mode. Also known as the heterogeneous programming mode, the host system in this mode offloads part or all of the computation from one or multiple processes or threads running on the host. The application starts execution on the host. As the computation proceeds, it can decide to send data to the coprocessor and let the coprocessor work on it. The host and the coprocessor may or may not work in parallel in the offload execution model. This is the common execution model in other coprocessor operating environments. As of this writing, there is an OpenMP 4.0 TR being proposed and implemented in Intel Composer XE to provide directives to perform offload computations. Composer XE also provides some custom directives to perform offload operations. This mode of operation is available on both Linux and Windows.
-
Coprocessor native execution mode. An Intel Xeon Phi has a Linux micro OS running in it and can appear as another machine connected to the host, like another node in a cluster. This execution environment allows the users to view the coprocessor as another compute node. In order to run natively, an application has to be cross-compiled for the Xeon Phi operating environment. Intel Composer XE provides a simple switch to generate cross-compiled code.
-
Symmetric execution. In this mode the application processes run on both the host and the Intel Xeon Phi coprocessor. They usually communicate through some sort of message-passing interface such as Message Passing Interface (MPI). This execution environment treats the Xeon Phi card as another node in a cluster in a heterogeneous cluster environment.
Development Tools for Intel Xeon Phi Architecture
Intel Composer XE
Getting the Tools
http://software.intel.com/en-us/intel-software-evaluation-center/
.Using the Compilers
compilervars.sh
included with the compiler. If you have installed this in the default path chosen by the compiler, the batch file to set the environment can be found at /opt/intel/composerxe/bin/compilervars.sh
. To set the path invoke > source /opt/intel/composerxe/bin/compilervars.sh intel64
.icc
for building C source files and icpc
for building and linking C++ source files. For Fortran sources, you need to use the ifort
command for both compiler and link. Make sure you link with the appropriate command, as these commands link to the proper libraries to produce the executable.>icc -help
to figure out the appropriate options for your compiler builds. In most cases, if not asked specifically for compiling only, an icc
or icpc
command will invoke both the compiler and the linker. In fact the commands icc
, icpc
, and ifort
are driver programs that in turn parse the command-line arguments and processes in accordance with the compiler or the linker as necessary. The driver program processes the input file and calls the linker with the object files created, as well as the library files necessary to generate final executables or libraries. That is why it is important to use the proper compiler so that the appropriate libraries can be linked.File extensions | Interpretation | Execution |
---|---|---|
.c | C source file | C/C++ compiler |
.C, .CC, .cc, .cpp, .cxx | C++ source file | C++ compiler |
.f, .for, .ftn, .i, .fpp, .FPP, .F, .FOR, .FTN | Fixed form Fortran | Fortran compiler |
.f90, .i90, .F90 | Free form Fortran | Fortran compiler |
.a, .so, .o | Library, object files | Linker |
.s | Assembly file | assembler |
<compiler name> [options] file1 [file2…]
<compiler name>
is one of the compiler names such as icc, icpc, ifort;[options]
are options that are passed to the compiler and can control code generation, optimization, and output file names, type, and path.[options]
are specified, the compiler invokes some default options, such as –O2 for default optimization. If you want to modify the default option for compilation, you will need to modify the corresponding configuration file found in the installed <compiler install path>bin/intel64_mic
or similar folders and named as icc.cfg
, icpc.cfg
, and so forth. Please refer to the compiler manual for details.Setting Up an Intel Xeon Phi System
Install the MPSS Stack
http://software.intel.com/mic-developer
), go to the tab Tools & Downloads, and select “Intel Many Integrated Core Architecture (Intel MIC Architecture) Platform Software Stack.” Download the appropriate version of the MPSS to match your host OS and also download the readme.txt
from the same location.user_prompt> ssh-keygen
user_prompt> sudo service mpss stop
user_prompt> sudo micctrl --resetconfig
user_prompt> sudo service mpss start
command prompt> yum install --nopgpcheck --noplugins --disablerepo=* *.rpm
command_prompt>micctrl -r
command_prompt>/opt/intel/mic/bin/micinfo
, which will print out the Intel Xeon Phi–related information including the flash file.1
/opt/intel/mic/flash
and should match with those printed out as part of the micinfo. If the installed version is older than the one available with the new MPSS you are installing, update the flash with the micflash utility. Please refer to the readme.txt
provided with the documentation to select the proper flash file. Once you have determined the proper flash file for the revision of the card on your system, use the following command to flash: command_prompt>/opt/intel/mic/bin/micflash -Update /opt/intel/mic/flash/<your flash file name>
service mpss start|stop|restart
commands.readme.txt
if you encounter any issue starting the card.Install the Development Tools
Code Generation for Intel Xeon Phi Architecture
Native Execution Mode
Hello World Example
test.c
://Content of test.c
#include <stdio.h>
int main()
{
printf("Hello world from Intel Xeon Phi\n");
}
-mmic
switch as follows:command_prompt>icc -mmic test.c -o test.out
test.out
on the same folder as your source. Now copy the source file to the Intel Xeon Phi mic0
as follows:command_prompt>scp test.out mic0:
test.out
file to your home directory on the coprocessor environment.ssh
command as follows:command_prompt>ssh mic0
[command_prompt-mic0]$ ls
test.out
test.out
on the native coprocessor environment. If you run it on the card, it will printout:command_prompt-mic0>./test.out
Hello world //printed from Intel Xeon Phi
command_prompt-mic0>
Language Extensions to Support Offload Computation on Intel Xeon Phi
Heterogeneous Computing Model and Offload Pragmas
Language Extensions and Execution Model
Terminology
-
device. A device may have one or more co-processors with their own memories or a host. A host device is the device executing the main thread. A target device executes the offloaded code segment.
-
offload. The process of sending a computation from host to target.
-
data environment. The variables associated with a given execution environment.
-
device data environment. A data environment associated with target data or a target construct.
-
mapped variable. Either variable when a variable in a data environment is mapped to a variable in a device data environment. The original and corresponding variables may share storage.
-
mappable type. A valid data type for a mapped variable.
Offload Function and Data Declaration Directives
declare target Directives
declare target
directives declare data, functions, and subroutines that should be available in a target (coprocessor) execution environment. They allow the creation of versions of specified function or data that can be used inside a target region executing on the coprocessor.Syntax
C/C++
#pragma omp declare target
new-line
declaration-definition-sequence
#pragma omp end declare target
new-line
Fortran
!$omp declare target
(list) new-line
declare
and end declare target
or in the list
argument are created in the device context and can be used or executed in the target region.Restrictions
-
Threadprivate variables cannot be in a
declare target
directive. -
Variables declared in a
declare target
directive must have a mappable type. -
In C/C++ the variables declared in a
declare target
directive must be at file or namespace scope.
Function Offload and Execution Constructs
pragma target
and pragma target data
provide the capability to offload computations to a coprocessor(s).Target Data Directive
Syntax
C/C++
#pragma omp target data [clause [[,] clause],...] new-line
structured-block
Fortran
!$omp target [clause[[,] clause],...]
parallel-loop-construct | parallel-sections-construct
!$omp end target
-
device(scalar-integer-expression)
-
The integer expression must be a positive number to differentiate various coprocessors available on a host. If no device is specified, the default device is determined by internal control variable (ICV) named
default-device-var
(OMP_DEFAULT_DEVICE openmp
environment variable). The default data environment is constructed from the enclosing device environment, the data environment of the enclosing task, and the data mapping clauses in the construct.
-
-
map([map-type:]list)
-
These are data motion clauses that allow copying and mapping of variables or common block to or from the host scope to the target device scope. The map type:
-
alloc
indicates the data are allocated on the device and have an undefined initial value. -
to
declares that on entering the region, each new data in the list will be initialized to original list item value. -
from
declares that the data elements are “out” type and copied from the device data to host data on exit from the region. -
tofrom(Default)
declares that data elements are in or out type and values are copied to and from the data elements in the device corresponding to data elements on the host. -
If the list is an array element, it must be a contiguous region.
-
-
if(scalar-expr)
-
if the scalar-expression evaluates to false, the device is a host.
-
Restrictions
-
At most one
device
clause may appear on the directive. The device expression must evaluate to a positive integer value. -
At most one
if
clause can appear on the directive.
Target Directive
-
A
target
region begins as a single thread of execution and executes sequentially, as if enclosed in an implicit task region, called the initial device task region. -
When a
target
construct is encountered, thetarget
region is executed by the implicit device task. -
The task that encounters the
target
construct waits at the end of the construct until execution of the region completes. If a coprocessor does not exist, is not supported by the implementation, or cannot execute thetarget
construct, then thetarget
region is executed by the host device. -
The data environment is created at the time the construct is encountered, if needed. Whether a construct creates a data environment is defined in the description of the construct.
Syntax
C/C++
#pragma omp target
[clause[[,] clause],...] new-line
structured-block
Fortran
!$omp target
[clause[[,] clause],...]
structured-block
!$omp end target
-
-
device(scalar-integer-expression)
-
The integer expression must be a positive number to differentiate various coprocessors available on a host. If no device is specified, the default device is determined by the ICV named
default-device-var
(OMP_DEFAULT_DEVICE openmp
environment variable). The default data environment is constructed from the enclosing device environment, the data environment of the enclosing task, and the data mapping clauses in the construct.
-
-
map([map-type:]list)
-
These are data motion clauses that allow copying and mapping of variables or common block to or from the host scope to the target device scope. The map type:
-
alloc
indicates the data are allocated on the device and have an undefined initial value. -
to
declares that on entering the region, each new data in the list will be initialized to original list item value. -
from
declares that the data elements is “out” type and copied from the device data to the host data on exit from the region. -
tofrom(Default)
declares that data elements are in or out type and values are copied to and from the data elements in the device corresponding to data elements on the host. -
If the list is an array element, it must be contiguous region.
-
-
if(scalar-expr)
-
if the scalar-expression evaluates to false, the device is a host.
-
-
if
clause is present and the logical expression inside the if
clause evaluates to false, the target region is not executed by the device but executed on the host.Restrictions
-
If a target, target update, or target data construct appears within a target region, then the behavior is undefined.
-
The result of an
omp_set_default_device
,omp_get_default_device
, oromp_get_num_devices
routine called within a target region is unspecified. -
The effect of access to a
threadprivate
variable in a target region is unspecified. -
A variable referenced in a
target construct
that is not declared in the construct is implicitly treated as if it had appeared in amap
clause with a map type oftofrom
. -
A variable referenced in a target region but not declared in the target construct must appear in a declare target directive.
-
C/C++ specific: A throw executed inside a target region must cause execution to resume within the same target region, and the same thread that threw the exception must catch it.
Target Update Directive
Syntax
C/C++
#pragma omp target update motion-clause
[clause[[,] clause],...] new-line
Fortran
!$omp target update motion-clause
[clause[[,] clause],...]
motion-clause
is one of the following:to(list)
from(list)
to
or from
clause corresponds to a device item and a host list item. The from
clause corresponds to out
data from the device to the host and the to
clause corresponds to in
data from the host to the device.-
device(scalar-integer-expression)
default-device-var
.
-
if(scalar-expr): If the scalar expression evaluates to false, the
update
clause is ignored.
Runtime Library Routines
void omp_set_default_device(int device_num),
int omp_get_default_device();
default-device-var
. The corresponding environment variable is OMP_DEFAULT_DEVICE
.int omp_get_num_devices();
Offload Example
1 // Sample code reduction.cpp
2 // Example showing use of OpenMP 4.0 pragmas for offload calculation
3 // This code was compiled with Intel(R) Composer XE 2013
4
5 #include <stdio.h>
6
7 #define SIZE 1000
8 #pragma omp declare target
9 int reduce(int *inarray)
10 {
11
12 int sum=0;
13 #pragma omp target map(inarray[0:SIZE]) map(sum)
14 {
15 for(int i=0;i<SIZE;i++)
16 sum += inarray[i];
17 }
18 return sum;
19 }
20
21 int main()
22 {
23 int inarray[SIZE], sum, validSum;
24
25 validSum=0;
26 for(int i=0; i<SIZE; i++){
27 inarray[i]=i;
28 validSum+=i;
29 }
30
31 sum=0;
32 sum = reduce(inarray);
33
34 printf("sum reduction = %d, validSum=%d\n",sum, validSum);
35 }
#pragma omp target map(inarray[0:SIZE]) map(sum)
, which causes specific code block lines (14 to 17) to be sent to the coprocessor for computing. In this case it is computing the reduction of an array of numbers and returning the computed value through the sum
variable to the host. The inarray
and the sum
are copied in and out of the coprocessor before and after the computation.reduce
function in line 32 in the code block. The code in turn offloads part of the computation to the Intel Xeon Phi to compute the sum reduction. Once the reduction is done, the results received from the coprocessor are returned to the main function (line 32) and compared against the reduction done on the host to validate the results.command_prompt>icpc –openmp reduction.cpp –o test.out
test.out
on the host environment as shown in Listing 2-2 using:command_prompt>./test.out
[Offload] [MIC 1] [File] reduction.cpp
[Offload] [MIC 1] [Line] 13
[Offload] [MIC 1] [Tag] Tag0
[Offload] [MIC 1] [CPU Time] 0.000000 (seconds)
[Offload] [MIC 1] [CPU->MIC Data] 4012 (bytes)
[Offload] [MIC 1] [MIC Time] 0.000177 (seconds)
[Offload] [MIC 1] [MIC->CPU Data] 4004 (bytes)
sum reduction = 499500, validSum=499500
map
clause to map(to:inarray[0:SIZE])
in line 13. The output of such a modification is shown in Listing 2-3 below:[Offload] [MIC 1] [File] reduction.cpp
[Offload] [MIC 1] [Line] 13
[Offload] [MIC 1] [Tag] Tag0
[Offload] [MIC 1] [CPU Time] 0.000000 (seconds)
[Offload] [MIC 1] [CPU->MIC Data] 4004 (bytes)
[Offload] [MIC 1] [MIC Time] 0.000156 (seconds)
[Offload] [MIC 1] [MIC->CPU Data] 4 (bytes)
[MIC->CPU data]
value.