AN5D: automated stencil framework for high-degree temporal blocking on GPUs

Authors:
Kazuaki Matsumura

Barcelona Supercomputing Center, Spain

Barcelona Supercomputing Center, Spain
View Profile

,
Hamid Reza Zohouri

Edgecortix, Japan

Edgecortix, Japan
View Profile

,
Mohamed Wahib

AIST, Japan

AIST, Japan
View Profile

,
Toshio Endo

Tokyo Institute of Technology, Japan

Tokyo Institute of Technology, Japan
View Profile

,
Satoshi Matsuoka

RIKEN CCS, Japan

RIKEN CCS, Japan
View Profile

CGO 2020: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and OptimizationFebruary 2020Pages 199–211https://doi.org/10.1145/3368826.3377904

Published:22 February 2020Publication History

Related Artifact: AN5D-Artifact January 2020 software https://doi.org/10.1145/3373127

CGO 2020: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization

Pages 199–211

ABSTRACT

Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU.

Index Terms

AN5D: automated stencil framework for high-degree temporal blocking on GPUs
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Toward accelerated stencil computation by adapting tensor core unit on GPU
ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

The Tensor Core Unit (TCU) has been increasingly adopted on modern high performance processors, specialized in boosting the performance of general matrix multiplication (GEMM). Due to its highly optimized hardware design, TCU can significantly ...
Read More
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for ...
Read More
Revisiting Temporal Blocking Stencil Optimizations
ICS '23: Proceedings of the 37th International Conference on Supercomputing

Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CGO 2020: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization
February 2020
329 pages
ISBN:9781450370479
DOI:10.1145/3368826
General Chairs:
Jason Mars
University of Michigan, USA
,
Lingjia Tang
University of Michigan, USA
,
Program Chairs:
Jingling Xue
UNSW, Australia
,
Peng Wu
Futurewei Technologies, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 February 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available
Author Tags
Automatic Code Generation
GPU
Stencil Computation
Temporal Blocking
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate312of1,061submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 1,081
  Total Downloads
- Downloads (Last 12 months)258
- Downloads (Last 6 weeks)41
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

AN5D: automated stencil framework for high-degree temporal blocking on GPUs

CGO 2020: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization

ABSTRACT

Cited By

Index Terms

Recommendations

Toward accelerated stencil computation by adapting tensor core unit on GPU

Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Revisiting Temporal Blocking Stencil Optimizations