Skip to main content
Top

Open Access 2013 | Open Access | Book

Cover of the book

Intel® Xeon Phi™ Coprocessor Architecture and Tools

The Guide for Application Developers

insite
SEARCH

About this book

Intel® Xeon Phi™ Coprocessor Architecture and Tools: The Guide for Application Developers provides developers a comprehensive introduction and in-depth look at the Intel Xeon Phi coprocessor architecture and the corresponding parallel data structure tools and algorithms used in the various technical computing applications for which it is suitable. It also examines the source code-level optimizations that can be performed to exploit the powerful features of the processor.

Xeon Phi is at the heart of world’s fastest commercial supercomputer, which thanks to the massively parallel computing capabilities of Intel Xeon Phi processors coupled with Xeon Phi coprocessors attained 33.86 teraflops of benchmark performance in 2013. Extracting such stellar performance in real-world applications requires a sophisticated understanding of the complex interaction among hardware components, Xeon Phi cores, and the applications running on them.

In this book, Rezaur Rahman, an Intel leader in the development of the Xeon Phi coprocessor and the optimization of its applications, presents and details all the features of Xeon Phi core design that are relevant to the practice of application developers, such as its vector units, hardware multithreading, cache hierarchy, and host-to-coprocessor communication channels. Building on this foundation, he shows developers how to solve real-world technical computing problems by selecting, deploying, and optimizing the available algorithms and data structure alternatives matching Xeon Phi’s hardware characteristics. From Rahman’s practical descriptions and extensive code examples, the reader will gain a working knowledge of the Xeon Phi vector instruction set and the Xeon Phi microarchitecture whereby cores execute 512-bit instruction streams in parallel.

Table of Contents

Frontmatter

Hardware Foundation: Intel Xeon Phi Architecture

Frontmatter

Open Access

Chapter 1. Introduction to Xeon Phi Architecture
Abstract
Technical computing can be defined as the application of mathematical and computational principles to solve engineering and scientific problems. It has become an integral part of the research and development of new technologies in modern civilization. It is universally relied upon in all sectors of industry and all disciplines of academia for such disparate tasks as prototyping new products, forecasting weather, enhancing geosciences exploration, performing financial modeling, and simulating car crashes and the propagation of electromagnetic field from mobile phones.
Rezaur Rahman

Open Access

Chapter 2. Programming Xeon Phi
Abstract
Viewing the Intel Xeon Phi as a black box, you can infer its architecture from its responses to the impulses you provide it: namely, the software instructions you execute on the coprocessor. The objective of this book is to introduce you to Intel Xeon Phi architecture inasmuch
Rezaur Rahman

Open Access

Chapter 3. Xeon Phi Vector Architecture and Instruction Set
Abstract
Two key hardware features that dictate the performance of technical computing applications on Intel Xeon Phi are the vector processing unit and the instruction set implemented in this architecture. The vector processing unit (VPU) in Xeon Phi provides data parallelism at a very fine grain, working on 512 bits of 16 single-precision floats or 32-bit integers at a time. The VPU implements a novel instruction set architecture (ISA), with 218 new instructions compared with those implemented in the Xeon family of SIMD instruction sets.
Rezaur Rahman

Open Access

Chapter 4. Xeon Phi Core Microarchitecture
Abstract
A processor core is the heart that determines the characteristics of a computer architecture. It is where the arithmetic and logic functions are mostly concentrated. The instruction set architecture (ISA) is implemented in this portion of the circuitry. Yet, in a modern-day architecture such as Intel Xeon Phi, less than 20 percent of the chip area is dedicated to the core. A survey of the development of the Intel Xeon Phi architecture will elucidate why its coprocessor core is designed the way it is.
Rezaur Rahman

Open Access

Chapter 5. Xeon Phi Cache and Memory Subsystem
Abstract
The preceding chapter showed how the Intel Xeon Phi coprocessor uses a two-dimensional tiled architecture approach to designing manycore coprocessors. In this architecture, the cores are replicated on die and connected through on-die wire interconnects. The network connecting the various functional units is a critical piece that may become a bottleneck as more cores and devices are added to the network in a chip multiprocessor (CMP) design such as Intel Xeon Phi uses. The interconnect design choices are primarily determined by the number of cores, expected interconnect performance, chip area limitation, power limit, process technology, and manufacturing efficiencies. The manycore interconnect technology—although it has benefited from existing research on other interconnect topologies in multiprocessor systems and the close interaction among cores, cache subsystem, memory, and external bus—makes interconnect design for coprocessors especially challenging.
Rezaur Rahman

Open Access

Chapter 6. Xeon Phi PCIe Bus Data Transfer and Power Management
Abstract
This chapter looks at how the coprocessor is configured in a Xeon-based server platform and communicates with the host. It will also look at the power management capabilities built into the coprocessor to help reduce power consumption while idle. Figure 6-1 shows a system with multiple Intel Xeon Phi and two socket Intel Xeon processors. The coprocessor connects to the host using PCI Express 2.0 interface x16 lanes. Data transfer between the host memory and the GDDR memory can be through programmed I/O or through direct memory access (DMA) transfer. In order to optimize the data transfer bandwidth for large buffers, one needs to use the DMA transfer mechanism. This section will explain how to use high-level language features to allow DMA transfer. The hardware also allows peer-to-peer data transfers between two Intel Xeon Phi cards. Various data transfer scenarios are shown in Figure 6-1. The two Xeon Phi coprocessors A and B in the figure connect to the PCIe channels attached to the same socket and can do a local peer-to-peer data transfer. The data transfer between Xeon Phi coprocessors B and C will be a remote data transfer. These configurations play a key role in determining how the cards need to be set up for optimal performance.
Rezaur Rahman

Software Foundation: Intel Xeon Phi System Software and Tools

Frontmatter

Open Access

Chapter 7. Xeon Phi System Software
Abstract
Intel Xeon Phi needs support from system software components to operate properly and interoperate with other hardware components in a system. The system software component of the Intel Xeon Phi system, known as the Intel Many Integrated Core (MIC) Platform Software Stack (MPSS), provides this functionality. Unlike other device drivers implemented to support PCIe-based hardware, such as graphics cards, Intel Xeon Phi was designed to support the execution of technical computing applications in the familiar HPC environment through the MPI environment, as well as other offload programming usage models. Because the coprocessor core is based on the traditional Intel P5 processor core, it can execute a complete operating system like any other computer. The disk drive is simulated by a RAM drive and supports an Internet protocol (IP)-based virtual socket to provide networking communication with the host. This design choice allows the coprocessor to appear as a node to the rest of the system and allows a usage model common in the HPC programming environment. The operating system resides on the coprocessor and implements complementary functionalities provided by the driver layer on the host side to achieve its system management goals.
Rezaur Rahman

Open Access

Chapter 8. Xeon Phi Application Development Tools
Abstract
In Chapter 2 we looked at how the Intel compiler may be used to build code for the Intel Xeon Phi coprocessor. This chapter will go more deeply and widely into the tools available for development on Intel Xeon Phi coprocessor. However, Intel tools have a lot of features that are outside of the scope of this book, which only focuses on the features relevant to Xeon Phi development. For a general introduction to the tools, please refer to the documentation installed with the tools themselves. The tools can be divided into four broad categories:
Rezaur Rahman

Applications: Technical Computing Software Development on Intel Xeon Phi

Frontmatter

Open Access

Chapter 9. Xeon Phi Application Design and Implementation Considerations
Abstract
Parallel programming on any general-purpose processor such as Intel Xeon Phi requires careful consideration of the various aspects of program organization, algorithm selection, and implementation to achieve the maximum performance on the hardware.
Rezaur Rahman

Open Access

Chapter 10. Application Performance Tuning on Xeon Phi
Abstract
This chapter explains how to tune the performance of applications developed for Xeon Phi. The work of achieving optimal performance starts by designing your application with proper consideration to application design and implementations, as discussed in Chapter 9. Once an application has been developed, you can tune it further by optimizing the code you have developed for the Xeon Phi coprocessor architecture. The tuning process involves the use of tools such as VTune, compiler, code structuring, and libraries in conjunction with your understanding of architecture to fix the issues that cause performance bottlenecks. The “artistic” aspect of the tuning process will emerge incrementally during the course of your hands-on work with the hardware and the application as you figure out how to apply various tools efficiently to optimize the code fragment that cause the bottleneck. This chapter will provide the best-known methods (BKMs) to start optimizing code for the Xeon Phi coprocessor. I will assume in this chapter that you have already parallelized the code as part of your algorithm design, as discussed in Chapter 9.
Rezaur Rahman

Open Access

Chapter 11. Algorithm and Data Structures for Xeon Phi
Abstract
Algorithms and data structures appropriate for Xeon Phi are active fields of research and deserve a book on their own. This chapter will touch only on some common algorithm and data structure optimization techniques that I have found useful for common technical computing applications. These algorithms will definitely evolve as we gain more experience with the hardware. This chapter does not derive the algorithms but rather focuses on optimization techniques to achieve good performance on Xeon Phi. For example, I assume familiarity with Monte Carlo simulation techniques and the algorithms used in financial applications and instead focus on those components of the algorithms that be optimized to make the most effective use of Xeon Phi architecture capabilities.
Rezaur Rahman

Open Access

Chapter 12. Xeon Phi Application Development on Windows OS
Abstract
So far we have looked at application development on the Linux OS for the Xeon Phi coprocessor. This chapter looks at what types of support are available on Windows OS for developing applications for Xeon Phi. Some application domains such as computer-aided design (CAD) and other workstation applications that can benefit from the raw compute power of Intel Xeon Phi are mostly used on Windows OS. In such cases, you would be able to offload part of the computationally intensive code section to the coprocessor by using the offload programming model, such as that based on the OpenMP 4.0 standard. Most of the concepts in this chapter also apply to the Linux development environment on Xeon Phi.
Rezaur Rahman

Open Access

Appendix A. OpenCL on Xeon Phi
Abstract
Open Computing Language (OpenCL) standardizes language and application programming interfaces (APIs) for programming heterogeneous parallel computing systems, such as hosts containing Xeon Phi coprocessors. This is an open standard maintained by the industry consortium, Khronos Group, and adopted by various leading companies to make the language-based applications portable across devices.
Rezaur Rahman

Open Access

Appendix B. Virtual Shared Memory Programming on Xeon Phi
Abstract
Although nonshared memory programming is widely used in developing applications for the Xeon Phi coprocessor, the Intel C/C++ compiler supports virtual shared memory programming. The advantage of this programming model is that it allows you to have more complex data structures shared between the host and client, removing the requirement that the data objects be bitwise copyable (such that the data can be copied using simple memcpy function) between the host and the coprocessor. With virtual shared memory constructs, the data copied between the host and coprocessor can be arbitrarily complex, including structures containing pointers such as linked list and tree data structures. The data must be placed in the shared virtual space before the offload computation can be performed on the shared virtual memory objects. In this model, the underlying software implements and maintains virtual memory that is shared between the host and the coprocessor.
Rezaur Rahman
Backmatter
Metadata
Title
Intel® Xeon Phi™ Coprocessor Architecture and Tools
Author
Rezaur Rahman
Copyright Year
2013
Publisher
Apress
Electronic ISBN
978-1-4302-5927-5
Print ISBN
978-1-4302-5926-8
DOI
https://doi.org/10.1007/978-1-4302-5927-5

Premium Partner