Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

https://doi.org/10.1016/j.jpdc.2014.07.003Get rights and content

Highlights

  • We developed a performance portable programming model (PM) for manycore devices.

  • Unifying parallel dispatch and data layout is mandatory for performance portability.

  • The Kokkos C++library implements this PM with pthreads, OpenMP, and CUDA back-ends.

  • Demonstrate Xeon Phi and NVIDIA GPU performance portability with mini-applications.

  • Recommend a strategy for legacy application codes to migrate to manycore.

Abstract

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.

Introduction

The Kokkos C++ library provides scientific and engineering codes with a programming model that enables performance portability across diverse and evolving manycore devices. Our performance portability objective is to maximize the amount of user code that can be compiled for diverse devices and obtain the same (or nearly the same) performance as a variant of the code that is written specifically for that device. Performance portability is our primary objective for a high performance computing (HPC) programming model, and we address usability only within this constraint. Future usability studies will be conducted in conjunction with early adoption of Kokkos by applications and domain libraries.

The scope of Kokkos has evolved from a hidden portability layer for sparse linear algebra kernels  [2] to a hierarchy of broadly usable libraries. Our earlier implementation of Kokkos’ fundamental abstractions was referred to as KokkosArray   [16], [17], [15]. These fundamental abstractions have persisted to the current version of Kokkos. The semantics, syntax, and implementation of Kokkos has significantly evolved in response to new device capabilities, performance evaluations, and usability evaluations through an expanding suite of mini-applications.

Our fundamental programming model abstractions are as follows:

  • 1.

    Kokkos executes computational kernels in fine-grain data parallel within an execution space.

  • 2.

    Computational kernels operate on multidimensional arrays residing in memory spaces.

  • 3.

    Kokkos provides these multidimensional arrays with polymorphic data layout, similar to the Boost.MultiArray  [18] flexible storage ordering.

Kokkos enables computational kernels to be performance portable across manycore architectures (i.e., CPU and GPU) by unifying these abstractions. A data parallel computational kernel’s data access pattern can have a significant impact on its performance. On a CPU a computational kernel should have blocked data access pattern; however, on a GPU the computational kernel should have a coalesced data access pattern. This conflicting data access pattern requirement is commonly referred to as the array of structures (AoS) versus structure of arrays (SoA) problem. We solve the AoS vs. SoA performance portability problem by controlling the data parallel execution of computational kernels on a device, providing a multidimensional array data structure for those kernels to use, and choosing the multidimensional array layout that results in the required memory access pattern. Kokkos enables performance portable user code if that code is implemented with Kokkos’ multidimensional arrays and parallel execution capabilities.

Many programming models control fine-grain parallel execution, as enumerated in Table 1. These programming models have a variety of implementation approaches: a library within a standard programming language, directives added to a standard language (e.g., #pragma statements), language extensions supported by source-to-source translators, or language variants supported by a compiler. Among the programming models that we surveyed (Table 1), Kokkos is unique in that (1) it is purely a library approach, and (2) it enables portability to CPUs and GPUs, and (3) it provides polymorphic data layout. These three characteristics of our programming model are essential for performance portability and maintainability of HPC applications and domain libraries that must move to diverse and evolving manycore architectures.

Kokkos has thin back-end implementations that map portable user code to lower level, device specialized programming models. This software design allows us to choose the most performant back-end for each target device and optimize Kokkos’ implementation for that back-end. Our current back-end implementations include CUDA  [11] for NVIDIA GPUs, and pthreads  [21] or OpenMP  [43] for CPUs and Intel Xeon Phi. Pthreads and OpenMP back-ends optionally use the portable hardware locality (hwloc) library  [5] for explicit placement of threads on cores. We use the Intel Xeon Phi co-processor in self-hosted mode, where processes run entirely on this device as opposed to using the offload model.

In this paper, we first describe Kokkos abstractions, API, and extension points. Then, we present performance results for unit-test kernels and mini-applications. Finally, we outline a strategy for legacy C++ codes to migrate to manycore devices.

Section snippets

Abstraction of a manycore device

Our abstraction of a modern HPC environment is a network of compute nodes where each compute node contains one or more manycore devices. A typical HPC application in this environment has at least two levels of parallelism: (1) distributed memory parallelism typically supported through a Message Passing Interface (MPI) library and (2) fine-grain shared memory parallelism supported through one of the many thread-level programming models.

In our abstraction, an MPI process has a single master

Multidimensional array

A Kokkos multidimensional array consists of: (1) a set of datum {xι} of the same value type and residing in the same memory space, (2) an index space XS defined by the Cartesian product of integer ranges, and (3) a layout XL—a bijective map between the index space and the set of datum. (Note that equality of datums’ values does not imply the same datum: xι=xκι=κ.) X=({xι},XS,XL)XS=[0..N0)×[0..N1)×XL:XS{xι}.

A function typically contains a sequence of nested loops over dimensions of an array X

Parallel execution

Parallel execution patterns  [28] are divided into two categories: (1) data parallel or single instruction multiple data (SIMD) and (2) task parallel or multiple instruction multiple data (MIMD). Kokkos currently implements data parallel execution with parallel_for, parallel_reduce, and parallel_scan operations. The parallel_scan operation was implemented after the initial submission of this paper and is not described here. Research and development is in progress for hierarchical task–data

Performance evaluation with simple kernels

We evaluate Kokkos performance with simple kernels and mini-applications (Section  6). Performance testing is carried out on our Compton and Shannon testbed clusters. Compton is used for Intel Xeon and Intel Xeon Phi tests, and Shannon is used for NVIDIA Kepler (K20x) tests. Testbed configuration details are given in Table 2. Note that in these configurations device refers to a dual socket Xeon node, a single Xeon Phi, and a single Kepler GPU respectively.

Results presented in this paper are for

MiniFE

MiniFE is a hybrid parallel (MPI + X) finite element mini-application that (1) constructs a linear system of equations for a 3D heat diffusion problem and (2) performs 200 iterations of a conjugate gradient (CG) solver on that linear system. This mini-application is designed to capture important performance characteristics of an implicit parallel finite element code. MiniFE has been implemented in numerous programming models, some of which are available through the Mantevo suite of

Legacy code migration strategy

The legacy code migration strategy presented here was developed based upon our experience implementing Kokkos variants of miniMD and miniFE. This strategy has five steps: (1) change data structures, (2) develop functors, (3) enable dispatch (offload model) for GPU execution, (4) optimize algorithms for threading, and (5) specialize kernels for specific architectures. These steps can be carried out either for the whole legacy, or incrementally within components of the legacy code. We described

Conclusion

The Kokkos C++ library implements our strategy for manycore performance portable HPC applications and libraries. Two foundational abstractions are implemented: (1) dispatching parallel functors to a manycore device and (2) managing the layout of multidimensional arrays so that those functors have device-appropriate memory access patterns. We defined Kokkos’ manycore parallel abstractions and summarized the C++ API.

We demonstrated performance portability unit test kernels and mini-applications

Acknowledgments

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the US Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. This paper is cross-referenced at Sandia as SAND2013-5603J.

H. Carter Edwards has over three decades of experience developing software for simulations of a variety of engineering domains. He is an expert in high performance computing (HPC) and is currently focusing on thread-scalable algorithms and data structures for heterogeneous manycore architectures such as NVIDIA GPU and Intel Xeon Phi.

He has a B.S. and M.S. in aerospace engineering from the University of Texas at Austin, and worked for ten years at the Johnson Space Center in the domain of

References (45)

  • C. Augonnet et al.

    StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

    Concurr. Comput.: Pract. Exp.

    (2011)
  • C.G. Baker et al.

    A light-weight API for portable multicore programming

  • S. Balay et al.

    Efficient management of parallelism in object oriented numerical software libraries

  • N. Bell, M. Garland, Efficient sparse matrix–vector multiplication on CUDA, NVIDIA Technical Report NVR-2008-004,...
  • F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier, S. Thibault, R. Namyst, hwloc: a generic...
  • A. Buttari et al.

    A class of parallel tiled linear algebra algorithms for multicore architectures

    Parallel Comput.

    (2009)
  • C++ Amp Home Page, July...
  • Charm++ Home Page, July 2013....
  • A. Chtchelkanova, C. Edwards, J. Gunnels, G. Morrow, J. Overfelt, R. van de Geijn, Towards usable and lean parallel...
  • Cilk Plus Home Page, July 2013....
  • CUDA Home Page, June 2013....
  • CUDA Toolkit Thrust Documentation, June...
  • V.V. Dimakopoulos et al.

    HOMPI: a hybrid programming framework for expressing and deploying task-based parallelism

  • A. Duran et al.

    OmpSs: a proposal for programming heterogeneous multi-core architectures

    Parallel Process. Lett.

    (2011)
  • H.C. Edwards, D. Sunderland, Kokkos Array performance-portable manycore programming model, in: PMAM, 2012, pp....
  • H.C. Edwards et al.

    Multicore/GPGPU portable computational kernels via multidimensional arrays

  • H.C. Edwards et al.

    Manycore performance-portability: Kokkos multidimensional array library

    Sci. Program.

    (2012)
  • R. Garcia, J. Siek, A. Lumsdaine, Boost.MultiArray, June...
  • T. Gautier, J.V.F. Lima, N. Maillard, B. Raffin, et al. XKAAPI: a runtime system for data-flow task programming on...
  • K. Gregory et al.

    C++ Amp, Accelerated Massive Parallelism with Microsoft Visual C++

    (2012)
  • IEEE Std 1003.1, 2004 Edition, <pthread.h>,...
  • Information Technology Industry Council, Programming Languages—C++, International Standard ISO/IEC 14882, first ed.,...
  • Cited by (0)

    H. Carter Edwards has over three decades of experience developing software for simulations of a variety of engineering domains. He is an expert in high performance computing (HPC) and is currently focusing on thread-scalable algorithms and data structures for heterogeneous manycore architectures such as NVIDIA GPU and Intel Xeon Phi.

    He has a B.S. and M.S. in aerospace engineering from the University of Texas at Austin, and worked for ten years at the Johnson Space Center in the domain of spacecraft guidance, navigation, and control. He has a Ph.D. in computational and applied mathematics, also from the University of Texas at Austin. He has been researching and developing software for HPC algorithms and data structures for the past sixteen years at Sandia National Laboratories.

    Christian R. Trott is a high performance computing expert with experience in designing and implementing software for GPU and MIC compute-clusters.

    He earned a Dr. rer. nat. from the University of Technology Ilmenau in theoretical physics. Prior scientific work focused on computational material research using Ab-Initio calculations, molecular dynamic simulations and Monte Carlo methods for investigations of ion-conducting glass materials. As of 2012 Christian is a postdoctoral appointee at the Sandia National Laboratories and is working on developing scientific codes for future manycore architectures.

    Daniel Sunderland is an expert in high performance computing who specializes in designing scalable data structures and algorithms for manycore architectures.

    He earned his master’s from Utah State University in computer science. Since 2009 he has been employed by Sandia National Laboratories developing and maintaining multi-physics engineering codes for current and future HPC architectures.

    View full text