Modern parallel programming for contemporary HPC systems [
3] typically involves multithreading for multi- or many-core CPUs and accelerators such as GPUs and efficient communication between cluster nodes. The former can be implemented with, e.g., OpenMP, OpenCL or Pthreads for CPUs and CUDA, OpenCL, OpenACC for GPUs, while the latter can be implemented typically with MPI. Paper [
4] presents an exemplary implementation and optimization of parallelization of large vector similarity computations in a hybrid CPU+GPU environment, including load balancing and finding configuration parameters. CUDA-aware MPI implementations allow using CUDA buffers in MPI calls which simplifies implementation.
Before NVIDIA’s Unified Memory was introduced, other researchers proposed solutions for making GPU programming easier, especially with respect to easier memory management. Paper [
1] presents a compiler approach for automatic scheduling of data transfers from and to accelerators which is aimed at reduction of data transferred between host and device. Only data needed or modified is transferred in identified locations in contrast to a naive approach with all data being copied. Significant gains were shown for selected Rodinia benchmarks such as: breadth-first search, particlefilter, speckle reducing anisotropic diffusion and Needleman–Wunsch. In paper [
11], the authors proposed design of region-based software virtual memory (RSVM), a software virtual memory layer for both GPU and CPU. It offered transparent swapping of GPU memory to main memory for multiple kernels and fetching data from host to the GPU. NVIDIA’s Unified Memory [
9,
19], introduced in CUDA 6, simplifies implementation of a CUDA application since basically it only requires allocation of memory using
cudaMallocManaged()
instead of
cudaMalloc()
and proper synchronization on the host side after invocation of a kernel function before reading results, since a kernel call is asynchronous from the point of view of the host. Unified Memory was further expanded in subsequent versions of CUDA. Specifically, CUDA 8 with Pascal and later GPUs [
26] extended UM with 49-bit virtual addressing and on-demand page migration. The new UM allows to use the whole system memory and memory oversubscription. In CUDA 9 and Volta [
27], tracking accesses to pages through additional counters has been introduced. Frequency of accesses can impact the driver when deciding on page movements. The driver can migrate pages proactively and perform intelligent eviction [
28].
In the Unified Memory version of the code, it is possible to provide hints on where and how data are to be used. It can be done with function
cudaMem PrefetchAsync()
for prefetching data to a GPU before a relevant kernel using that data are launched. Furthermore, it is possible to specify how data in managed memory at the given location with a given size will be used with a call to function
cudaMemAdvise()
[
20,
27]. Value
cudaMemAdviseSetReadMostly
advises that it will mostly be read and rarely written to,
cudaMemAdviseSet PreferredLocation
sets the preferred location to the memory of the given device,
and cudaMemAdviseSetAccessedBy
allows to hint that the data will be accessed from a given device and causes the data to be mapped in the processor’s page tables. It allows to prevent page faults [
31].
In non-UM versions of the code, it is possible to arrange overlapping of computations and communication by launching sequences of copy, kernel execution and copy in various streams such that copy and execute operations in various streams can be overlapped. Streams can also be used with prefetching in the UM enabled code [
27].
In paper [
14], the authors investigated Unified Memory access performance in CUDA. Performed experiments used custom as well as Rodinia microbenchmarks run on a system with Xeon E5530 CPUs and NVIDIA K20c GPUs. It has been demonstrated that in order to see performance benefits from using UM, kernels should operate on subsets of output data at a time allowing the paging subsystem to come into play. Another advantage of using UM is for very complex data structures. Other scenarios result in better performance for the regular memory management approach. In work [
22], the authors assessed relative performance of UM and non-UM versions of code for computation of scalar products. Kepler generation NVIDIA GeForce GTX 680 was used. The UM version resulted in 35% longer execution times. In paper [
15], the authors demonstrate an average 10% loss of performance when using versions of benchmarks such as CUDA SDK’s Diffusion3D Benchmark, Parboil Benchmark Suite and Matrix Multiplication ported to Unified Memory. Tests were performed on NVIDIA Kepler K40 and the TK1. It seems to be a reasonable performance penalty for potentially easier programming model. In paper [
10], we investigated performance of Unified Memory versions versus standard memory implementations with
cudaMalloc()
for three applications with control and data flow characteristic of SPMD, geometric SPMD and divide-and-conquer paradigms. Parallel implementations of verification of Goldbach’s conjecture, 2D heat transfer simulation and adaptive numerical integration were used. For the heat simulation, depending on the number of iterations per which data transfer was performed for visualization purposes, UM resulted in worse performance from about 1.5% for 50 iterations up to 8.7% for 10 iterations per visualization. For integration, UM versions resulted in worse performance as well, approximately 30% for integration of 100,000 subranges. Finally, for the implementation of Goldbach’s conjecture, performance of UM versions was only up to 2% worse than the standard version.
Unified Memory can also be used in programs written using OpenACC [
17,
25]. Article [
25] presents a 2D Jacobi iteration code along with Unified Memory. A 7x performance improvement over an OpenMP multicore CPU version running on Intel Xeon E5-2698 version 3 is presented and no performance loss compared to a standard OpenACC version running on NVIDIA K40. For LULESH, OpenACC+Unified Memory offered 3.14x performance increase over OpenMP run on the multicore CPU and 92% performance of a standard OpenACC version. Article [
26] shows further performance increase of the OpenACC+Unified Memory code run on NVIDIA Tesla Pascal P100 for 8.57x better than the CPU version.
Since OpenMP version 4.x support for GPU offloading has become available [
5]. The authors of paper [
18] have investigated usage and impact of Unified Memory when using at the level of OpenMP with latest offloading features. They have modified the OpenMP runtime in the LLVM framework in order for it to allocate data in Unified Memory. Tests were performed on SummitDev which has 54 nodes, each with 2 POWER8 CPUs and 4 Tesla P100 GPUs. Tests were performed using backpropagation, breadth-first search, CFD solver, K-means, kNN and speckle reducing anisotropic diffusion benchmarks. For applications with little data reuse, performance without Unified Memory is slightly better, as opposed to benchmarks with significant data reuse. For the latter, performance with Unified Memory is better to various degrees depending on input data size. Additionally, memory oversubscription is possible as an additional benefit to the standard implementation. Sometimes, though, as for BFS, CFD and SRAD it results in worse performance than for a CPU version. CPU codes for backpropagation and NN offered lower execution times than GPU versions. Usage of Unified Memory with OpenMP 4.5 and technical solutions were described in paper [
7].
Support for GPU Unified Memory in the OmpSs model and Nanos runtime is discussed in work [
30], along with its design and implementation as well as evaluation using microbenchmarks and Rodinia. For Rodinia benchmarks, speedups up to approximately 0.85 were obtained compared to pure CUDA versions.
Memory oversubscription has been studied in several papers in the literature. Paper [
33] investigates low-level implementation of paged GPU memory. It proposes ways to improve performance of such a solution by combining replayable far-faults along with demand prioritized prefetching. The results show results close within 15% to best overlapped communication and execution version. For oversubscription, the authors claim that in general a random eviction algorithm performs very well to more complex strategies, considering overheads of the latter. In paper [
13], the authors introduce GPUswap allowing relocation of application data from the GPU to system RAM allowing oversubscription of memory. At the time of the development and comparison with CUDA 6, the latter did not support memory oversubscription. Memory relocation delays for allocation of memory or giving up memory when another application is allocating memory are presented. Allocation delays are presented for various chunk sizes.
Work [
32] discusses design and results of running HPGMG (high-performance geometric multigrid methods) on Pascal GPUs, including usage of Unified Memory. Specifically, throughput from using the oversubscription feature is provided using NVIDIA P100 (16 GB memory size) compared to the throughput for configurations fitting within the GPU memory. For an NVLink P100 and a basic version, a performance drop from 160 to approx. 55-75 MDOF/s was observed which was still 2.5x higher than for a 2 socket Intel E5-2630 version 3 system for large memory sizes. After applying optimizations such as data prefetching using
cudaMemPrefetchAsync()
and user hints results from around 175 MDOF/s for fitting within GPU memory to 55-135 MDOF/s were observed for memory sizes exceeding the GPU memory size.
In paper [
2], it was shown how using multiple CUDA streams impacts performance of a GPU application that processes a stream of data chunks sent from the host. Specifically, considerable gains are shown for 2 streams compared to 1 stream, around 20–30% for compute intensities below 1 and up to 50% for larger compute intensities, among GPUs tested Tesla K20m, GTX 1060, GeForce 940MX and Tesla V100. Further increase between 3.3 and 4.9% is shown for 4 streams. Such data management optimization techniques can be especially profitable for frameworks for parallel data processing, such as in KernelHive [
23], able to process part of data on CPUs and part of data on GPUs within a cluster.
Paper [
6] proposes a CRUM (Checkpoint-Restart for Unified Memory) mechanism that allows forked checkpointing for Unified Memory with overlaps writing down a checkpoint during application execution. The work shows little overhead, 6% on average, for running parallel hybrid MPI+CUDA applications such as HPGMG-FV and HYPRE.
In paper [
12], the authors assessed execution times with use of UM on NVIDIA Tegra K1. The work analyzed UM-aware versions of benchmarks such as pathfinder, needle, srad v2, Gaussian and lud Rodinia as well as Gauss–Seidel relaxation. As a conclusion, the authors stated that in the case of kernel time percentage lower that 60%, UM exhibited gains on the K1 platform.
The motivation of this work is to assess preferable ways of programming efficient parallel codes problems running on modern GPUs, especially to compare relative performance and ease of programming of standard and Unified Memory-based approaches. What is important, assessment is to be performed also for computations performed on very large data sets which do not fit into the memory size of a single GPU. The relatively new memory oversubscription feature has not yet been assessed thoroughly in the literature. Results of relevant experiments are of high importance because they are applicable to codes from many domains falling into the same processing paradigms as the tested applications.
5 Summary and future work
Within this paper, performance of Unified Memory was assessed using the following applications: image processing including the Sobel filter and rotation, streamed image processing and fluid dynamic simulation. We compared performance of standard Unified Memory implementations as well as Unified Memory with data prefetching to standard implementation with implicit memory copying. Furthermore, we investigated performance of applications using Unified Memory for multistream codes as well as using Unified Memory with memory oversubscription. We have presented conclusions on relative performances of the implementations for particular application types and have shown differences for two modern GPU models—desktop Pascal series NVIDIA GTX 1080 and server Volta series NVIDIA V100 GPUs.
We have concluded that Unified Memory generally results in worse performance than the standard memory management approach offering the benefit of easier programming, and Unified Memory implementation also allows performance improvements through programmer-assisted data prefetching. However, in our experiments it resulted in better results than standard Unified Memory only for selected applications—namely multi-image processing with streams as well as for the fluid dynamic simulation. Unified Memory can bring better results than explicit memory copying for simulation codes such as fluid dynamic implementations with a relatively large compute-to-communication ratio.
In the future, we plan to extend this research toward systems with more GPUs as well as incorporation of UM into previous hybrid CPU+GPU implementations [
4]. Another direction of research will include testing impact of various host architectures on performance of GPU processing. Recently, we have tested impact of IBM’s AC922 system architecture compared to NVIDIA’s DGX station, both with 4 NVIDIA V100 GPUs on the performance of training deep neural networks [
24]. We plan to make a similar comparison using various host architectures for the applications presented in this work, using Unified Memory.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.