research-article

A compiler framework for optimization of affine loop nests for gpgpus

Authors:
Muthu Manikandan Baskaran

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
Uday Bondhugula

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
Sriram Krishnamoorthy

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
J. Ramanujam

Louisiana State University, Baton Rouge, LA, USA

Louisiana State University, Baton Rouge, LA, USA
View Profile

,
Atanas Rountev

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
P. Sadayappan

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

ICS '08: Proceedings of the 22nd annual international conference on SupercomputingJune 2008Pages 225–234https://doi.org/10.1145/1375527.1375562

Published:07 June 2008Publication History

ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

Pages 225–234

ABSTRACT

GPUs are a class of specialized parallel architectures with tremendous computational power. The new Compute Unified Device Architecture (CUDA) programming model from NVIDIA facilitates programming of general purpose applications on their GPUs. However, manual development of high-performance parallel code for GPUs is still very challenging. In this paper, a number of issues are addressed towards the goal of developing a compiler framework for automatic parallelization and performance optimization of affine loop nests on GPGPUs: 1) approach to program transformation for efficient data access from GPU global memory, using a polyhedral compiler model of data dependence abstraction and program transformation; 2) determination of optimal padding factors for conflict-minimal data access from GPU shared memory; and 3) model-driven empirical search to determine optimal parameters for unrolling and tiling. Experimental results on a number of kernels demonstrate the effectiveness of the compiler optimization approaches developed.

References

C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In PPoPP'91, pages 39--50, 1991. Google ScholarDigital Library
M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multilevel parallel architectures with explicitly managed memories. In ACM SIGPLAN PPoPP 2008, Feb. 2008. Google ScholarDigital Library
C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT'04, pages 7--16, 2004. Google ScholarDigital Library
U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communicationminimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction (ETAPS CC), Apr. 2008. Google ScholarDigital Library
U. Bondhugula, A. Hartono, J. Ramanujan, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Programming Languages Design and Implementation (PLDI '08), 2008. Google ScholarDigital Library
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In SIGGRAPH '04, pages 777--786, 2004. Google ScholarDigital Library
CLooG: The Chunky Loop Generator. http://www.cloog.org.Google Scholar
K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the efficiency of GPU algorithms for matrixmatrix multiplication. In ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 133--137, 2004. Google ScholarDigital Library
P. Feautrier. Dataflow analysis of array and scalar references. IJPP, 20(1):23--53, 1991.Google ScholarCross Ref
P. Feautrier. Some efficient solutions to the affine scheduling problem, part I: onedimensional time. IJPP, 21(5):313--348, 1992. Google ScholarDigital Library
P. Feautrier. Some efficient solutions to the affine scheduling problem, part II: multidimensional time. IJPP, 21(6):389--420, 1992. Google ScholarDigital Library
N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model for scientific algorithms on graphics processors. In SC '06, 2006. Google ScholarDigital Library
GeneralPurpose Computation Using Graphics Hardware. http://www.gpgpu.org/.Google Scholar
M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. FMI, University of Passau, 2004. Habilitation Thesis.Google Scholar
S.W. Liao, Z. Du, G. Wu, and G.Y. Lueh. Data and computation transformations for Brook streaming applications on multiprocessors. In CGO '06, pages 196--207, 2006. Google ScholarDigital Library
A. Lim. Improving Parallelism And Data Locality With Affine Partitioning. PhD thesis, Stanford University, Aug. 2001. Google ScholarDigital Library
A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In POPL, pages 201--214, 1997. Google ScholarDigital Library
NVIDIA CUDA. http://developer.nvidia.com/object/cuda.html.Google Scholar
NVIDIA GeForce 8800. http://www.nvidia.com/page/geforce_8800.html.Google Scholar
PLuTo: A polyhedral automatic parallelizer and locality optimizer for multicores. http://plutocompiler.sourceforge.net.Google Scholar
L.N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral model: Part I, onedimensional time. In CGO '07, pages 144--156, 2007. Google ScholarDigital Library
W. Pugh. The Omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 8:102--114, Aug. 1992. Google ScholarDigital Library
F. Quilleré, S. V. Rajopadhye, and D. Wilde. Generation of efficient nested loops from polyhedra. IJPP, 28(5):469--498, 2000. Google ScholarDigital Library
S. Ryoo, C. Rodrigues, S. Baghsorkhi, S. Stone, D. Kirk, and W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In ACM SIGPLAN PPoPP 2008, Feb. 2008. Google ScholarDigital Library
S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, and W. Hwu. Program optimization study on a 128core GPU. In The First Workshop on General Purpose Processing on Graphics Processing Units, October 2007.Google Scholar
S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Stratton, and W. Hwu. Program optimization space pruning for a multithreaded GPU. In CGO, 2008. Google ScholarDigital Library
D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program GPUs for generalpurpose uses. In ASPLOSXII, pages 325--335, 2006. Google ScholarDigital Library
N. Vasilache, C. Bastoul, and A. Cohen. Polyhedral code generation in the real world. In International Conference on Compiler Construction (ETAPS CC'06), pages 185--201, Mar. 2006. Google ScholarDigital Library

Index Terms

A compiler framework for optimization of affine loop nests for gpgpus
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Polyhedral parallel code generation for CUDA
Special Issue on High-Performance Embedded Architectures and Compilers

This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static ...
Read More
gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
Read More
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '08: Proceedings of the 22nd annual international conference on Supercomputing
June 2008
390 pages
ISBN:9781605581583
DOI:10.1145/1375527
General Chairs:
Theo Papatheodorou
University of Patras, Greece
,
Utpal Banerjee
Intel (retired), USA
,
Program Chairs:
Avi Mendelson
Intel, Israel
,
Kyle Gallivan
Florida State University, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 June 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
empirical tuning
memory access optimization
polyhedral model
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate629of2,180submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 144
  Total Citations
  View Citations
- 1,592
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A compiler framework for optimization of affine loop nests for gpgpus

ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Polyhedral parallel code generation for CUDA

gpucc: an open-source GPGPU compiler

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A compiler framework for optimization of affine loop nests for gpgpus

ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Polyhedral parallel code generation for CUDA

gpucc: an open-source GPGPU compiler

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media