ABSTRACT
This paper introduces hierarchical overlapped tiling, a transformation that applies loop tiling and fusion to conventional loops. Overlapped tiling is a useful transformation to reduce communication overhead, but it may also generate a significant amount of redundant computation. Hierarchical overlapped tiling performs overlapped tiling hierarchically to balance communication overhead and redundant computation, and thus has the potential to provide better performance.
In this paper, we describe the hierarchical overlapped tiling optimization and its implementation in an OpenCL compiler. We also evaluate the effectiveness of this optimization using 8 programs that implement different forms of stencil computation. Our results show that hierarchical overlapped tiling achieves an average 37% speedup over traditional tiling on a 32-core workstation.
- W. Abu-Sufah, D. Kuck, and D. Lawrie. On the performance enhancement of paging systems through program analysis and transformations. Computers, IEEE Transactions on, C-30(5):341--356, may 1981. Google ScholarDigital Library
- N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. In Supercomputing, ACM/IEEE 2000 Conference, page 31, nov. 2000. Google ScholarDigital Library
- R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, 2001. Google ScholarDigital Library
- M. Alpert. Not just Fun and Games. Scientific American, 4, 1999.Google Scholar
- W. F. Ames. Numerical Methods for Partial Differential Equations. Academic, San Diego, CA, sencond edition, 1977.Google Scholar
- R. Andonov, S. Balev, S. Rajopadhye, and N. Yanev. Optimal semi-oblique tiling. In Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures, SPAA '01, pages 153--162, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- W. Blume and R. Eigenmann. Symbolic range propagation. In Proceedings. of 9th International Parallel Processing Symposium, 1995., pages 357--363, apr 1995. Google ScholarDigital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, PLDI '08, pages 101--113, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, PLDI '95, pages 279--290, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
- K. Högstedt, L. Carter, and J. Ferrante. Selecting tile shape for minimal execution time. In Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, SPAA '99, pages 201--211, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
- W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusam. Compact thermal modeling for temperature-aware design. In Proceedings of the 41st annual Design Automation Conference, DAC '04, pages 878--883, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- T. Johnson, S.-I. Lee, L. Fei, A. Basumallik, G. Upadhyaya, R. Eigenmann, and S. Midkiff. Experiences in using cetus for source-to-source transformations. In Languages and Compilers for High Performance Computing, volume 3602 of Lecture Notes in Computer Science, pages 922--922. 2005. Google ScholarDigital Library
- W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The omega library interface guide. Technical report, 1995. Google ScholarDigital Library
- Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 8 December 2008.Google Scholar
- G. Kreisel and J.-L. Krivine. Elements of mathematical logic. North-Holland Pub. Co., 1967.Google Scholar
- S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, PLDI '07, pages 235--244, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. Dependence graphs and compiler optimizations. In Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '81, pages 207--218, New York, NY, USA, 1981. ACM. Google ScholarDigital Library
- A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-gpgpu accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 51--61, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- J. Liu, Y. Zhang, W. Ding, and M. Kandemir. On-chip cache hierarchy-aware tile scheduling for multicore machines. In Code Generation and Optimization (CGO), 2011 9th Annual IEEE/ACM International Symposium on, pages 161--170, april 2011. Google ScholarDigital Library
- D. B. Loveman. Program improvement by source-to-source transformation. Journal of the ACM, 24:121--145, 1977. Google ScholarDigital Library
- J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In Proceedings of the 23rd international conference on Supercomputing, ICS '09, pages 256--265, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for nonshared memory machines. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercomputing '91, pages 111--120, New York, NY, USA, 1991. ACM. Google ScholarDigital Library
- M. Ripeanu, A. Iamnitchi, and I. Foster. Cactus application: Performance predictions in grid environments. In Euro-Par 2001 Parallel Processing, volume 2150 of Lecture Notes in Computer Science, pages 807--816, 2001. Google ScholarDigital Library
- Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, PLDI '99, pages 215--228, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
- Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 117--128, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- M. Wolf and M. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452--471, oct 1991. Google ScholarDigital Library
- M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation, PLDI '91, pages 30--44, New York, NY, USA, 1991. ACM. Google ScholarDigital Library
- M. Wolfe. Iteration space tiling for memory hierarchies. In Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, pages 357--361, Philadelphia, PA, USA, 1989. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
- M. Wolfe. More iteration space tiling. In Proceedings of the 1989 ACM/IEEE conference on Supercomputing, Supercomputing '89, pages 655--664, New York, NY, USA, 1989. ACM. Google ScholarDigital Library
- D. Wonnacott. Time skewing for parallel computers. In Languages and Compilers for Parallel Computing, volume 1863, pages 477--480, 2000. Google ScholarDigital Library
Index Terms
- Hierarchical overlapped tiling
Recommendations
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumIntel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
Optimal Parallelogram Selection for Hierarchical Tiling
Loop tiling is an effective optimization to improve performance of multiply nested loops, which are the most time-consuming parts in many programs. Most massively parallel systems today are organized hierarchically, and different levels of the hierarchy ...
Writing productive stencil codes with overlapped tiling
Compilers for Parallel Computers 2007 Workshop (CPC 2007)Stencil computations constitute the kernel of many scientific applications. Tiling is often used to improve the performance of stencil codes for data locality and parallelism. However, tiled stencil codes typically require shadow regions, whose management ...
Comments