research-article

Hierarchical overlapped tiling

Authors:
Xing Zhou

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Jean-Pierre Giacalone

Intel Corporation

Intel Corporation
View Profile

,
María Jesús Garzarán

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Robert H. Kuhn

Intel Corporation

Intel Corporation
View Profile

,
Yang Ni

Intel Corporation

Intel Corporation
View Profile

,
David Padua

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

CGO '12: Proceedings of the Tenth International Symposium on Code Generation and OptimizationMarch 2012Pages 207–218https://doi.org/10.1145/2259016.2259044

Published:31 March 2012Publication History

CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization

Pages 207–218

ABSTRACT

This paper introduces hierarchical overlapped tiling, a transformation that applies loop tiling and fusion to conventional loops. Overlapped tiling is a useful transformation to reduce communication overhead, but it may also generate a significant amount of redundant computation. Hierarchical overlapped tiling performs overlapped tiling hierarchically to balance communication overhead and redundant computation, and thus has the potential to provide better performance.

In this paper, we describe the hierarchical overlapped tiling optimization and its implementation in an OpenCL compiler. We also evaluate the effectiveness of this optimization using 8 programs that implement different forms of stencil computation. Our results show that hierarchical overlapped tiling achieves an average 37% speedup over traditional tiling on a 32-core workstation.

References

W. Abu-Sufah, D. Kuck, and D. Lawrie. On the performance enhancement of paging systems through program analysis and transformations. Computers, IEEE Transactions on, C-30(5):341--356, may 1981. Google ScholarDigital Library
N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. In Supercomputing, ACM/IEEE 2000 Conference, page 31, nov. 2000. Google ScholarDigital Library
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, 2001. Google ScholarDigital Library
M. Alpert. Not just Fun and Games. Scientific American, 4, 1999.Google Scholar
W. F. Ames. Numerical Methods for Partial Differential Equations. Academic, San Diego, CA, sencond edition, 1977.Google Scholar
R. Andonov, S. Balev, S. Rajopadhye, and N. Yanev. Optimal semi-oblique tiling. In Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures, SPAA '01, pages 153--162, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
W. Blume and R. Eigenmann. Symbolic range propagation. In Proceedings. of 9th International Parallel Processing Symposium, 1995., pages 357--363, apr 1995. Google ScholarDigital Library
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, PLDI '08, pages 101--113, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, PLDI '95, pages 279--290, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
K. Högstedt, L. Carter, and J. Ferrante. Selecting tile shape for minimal execution time. In Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, SPAA '99, pages 201--211, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusam. Compact thermal modeling for temperature-aware design. In Proceedings of the 41st annual Design Automation Conference, DAC '04, pages 878--883, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
T. Johnson, S.-I. Lee, L. Fei, A. Basumallik, G. Upadhyaya, R. Eigenmann, and S. Midkiff. Experiences in using cetus for source-to-source transformations. In Languages and Compilers for High Performance Computing, volume 3602 of Lecture Notes in Computer Science, pages 922--922. 2005. Google ScholarDigital Library
W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The omega library interface guide. Technical report, 1995. Google ScholarDigital Library
Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 8 December 2008.Google Scholar
G. Kreisel and J.-L. Krivine. Elements of mathematical logic. North-Holland Pub. Co., 1967.Google Scholar
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, PLDI '07, pages 235--244, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. Dependence graphs and compiler optimizations. In Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '81, pages 207--218, New York, NY, USA, 1981. ACM. Google ScholarDigital Library
A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-gpgpu accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 51--61, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
J. Liu, Y. Zhang, W. Ding, and M. Kandemir. On-chip cache hierarchy-aware tile scheduling for multicore machines. In Code Generation and Optimization (CGO), 2011 9th Annual IEEE/ACM International Symposium on, pages 161--170, april 2011. Google ScholarDigital Library
D. B. Loveman. Program improvement by source-to-source transformation. Journal of the ACM, 24:121--145, 1977. Google ScholarDigital Library
J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In Proceedings of the 23rd international conference on Supercomputing, ICS '09, pages 256--265, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for nonshared memory machines. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercomputing '91, pages 111--120, New York, NY, USA, 1991. ACM. Google ScholarDigital Library
M. Ripeanu, A. Iamnitchi, and I. Foster. Cactus application: Performance predictions in grid environments. In Euro-Par 2001 Parallel Processing, volume 2150 of Lecture Notes in Computer Science, pages 807--816, 2001. Google ScholarDigital Library
Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, PLDI '99, pages 215--228, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 117--128, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
M. Wolf and M. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452--471, oct 1991. Google ScholarDigital Library
M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation, PLDI '91, pages 30--44, New York, NY, USA, 1991. ACM. Google ScholarDigital Library
M. Wolfe. Iteration space tiling for memory hierarchies. In Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, pages 357--361, Philadelphia, PA, USA, 1989. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
M. Wolfe. More iteration space tiling. In Proceedings of the 1989 ACM/IEEE conference on Supercomputing, Supercomputing '89, pages 655--664, New York, NY, USA, 1989. ACM. Google ScholarDigital Library
D. Wonnacott. Time skewing for parallel computers. In Languages and Compilers for Parallel Computing, volume 1863, pages 477--480, 2000. Google ScholarDigital Library

Index Terms

Hierarchical overlapped tiling
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
Read More
Optimal Parallelogram Selection for Hierarchical Tiling

Loop tiling is an effective optimization to improve performance of multiply nested loops, which are the most time-consuming parts in many programs. Most massively parallel systems today are organized hierarchically, and different levels of the hierarchy ...
Read More
Writing productive stencil codes with overlapped tiling
Compilers for Parallel Computers 2007 Workshop (CPC 2007)

Stencil computations constitute the kernel of many scientific applications. Tiling is often used to improve the performance of stencil codes for data locality and parallelism. However, tiled stencil codes typically require shadow regions, whose management ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization
March 2012
285 pages
ISBN:9781450312066
DOI:10.1145/2259016
General Chairs:
Carol Eidt
Microsoft
,
Anne Holler
VMware
,
Program Chairs:
Uma Srinivasan
Intel
,
Saman Amarasinghe
MIT
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 March 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
compiler optimization
loop tiling and fusion
stencil computation
Qualifiers
- research-article
Conference

Acceptance Rates
CGO '12 Paper Acceptance Rate26of90submissions,29%Overall Acceptance Rate312of1,061submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 77
  Total Citations
  View Citations
- 485
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hierarchical overlapped tiling

CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization

ABSTRACT

References

Cited By

Index Terms

Recommendations

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

Optimal Parallelogram Selection for Hierarchical Tiling

Writing productive stencil codes with overlapped tiling