ABSTRACT
Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU.
Index Terms
- AN5D: automated stencil framework for high-degree temporal blocking on GPUs
Recommendations
Toward accelerated stencil computation by adapting tensor core unit on GPU
ICS '22: Proceedings of the 36th ACM International Conference on SupercomputingThe Tensor Core Unit (TCU) has been increasingly adopted on modern high performance processors, specialized in boosting the performance of general matrix multiplication (GEMM). Due to its highly optimized hardware design, TCU can significantly ...
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysRecent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for ...
Revisiting Temporal Blocking Stencil Optimizations
ICS '23: Proceedings of the 37th International Conference on SupercomputingIterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data ...
Comments